Tutorial for identifying distal regulatory elements

calico / basenji

Sequential regulatory activity predictions with deep convolutional neural networks.

Apache License 2.0

390 stars 120 forks source link

Tutorial for identifying distal regulatory elements #59

Open SirFly opened 4 years ago

SirFly commented 4 years ago

Hi Basenji Developers, thanks a lot your project.

I would like to use your software to identify distal regulatory elements and outline their driving motives in K562 and PrEC cell lines. I see that a part of this procedure is in silico saturation mutagenesis and you posted a tutorial on that, but I cannot find any for regulatory elements identification.

I would be grateful if you could provide a comprehencive tutorial for this task.

Thanks!

davek44 commented 4 years ago

Hi, our work on this task is in flux right now. We're working hard to develop more effective methods. If you're using the master branch, I'd recommend using basenji_sed.py to score nucleotide mutations for their gene influence. You could then summarize those scores to describe larger regions with mean or max. Alternatively, you can use basenji_map.py to replicate the gradient-based approach that I used in the initial publication.

However, I'm hard at work about to release a new version of the code that will be TensorFlow2 compatible. It's currently on the tf2_hic branch. If you'd like to use that code, let me know and I can make different suggestions.

SirFly commented 4 years ago

Thanks for your quick reply! For now, I would like to use the gradient-based approach described in your publication.

Do I understand correctly, that in order to do that I need to use the bam_cov.file to get BigWigs and HDF5 files from FANTOM bam files, train a model that predicts gene coverage using basenji_train.py and then pass the trained model to basenji_map.py? And afterward, if I need to I could use basenji_sed.py tutorial for the identified regions?

Another question I have is about the options for basenji_map.py. What does the targets_file parameter mean and how should data passed there look like?

davek44 commented 4 years ago

Yes, that's the pipeline you would follow. Although instead of training your own model on the FANTOM data, you can simply use my pre-trained model, e.g. here: https://github.com/calico/basenji/tree/tf1/manuscripts/biorxiv2019. Note, that you'll now need to work from the basenji1 branch of the code. There's an example of the targets file in that directory, too. The goal of that option is to let you filter that file to only include the target data sets that you'd like to analyze.