calico / basenji

Sequential regulatory activity predictions with deep convolutional neural networks.
Apache License 2.0
404 stars 123 forks source link

How do I replicate Bassets peak classification using Basenji? #68

Open doczmp opened 4 years ago

doczmp commented 4 years ago

Hi,

Firstly, thanks for making these tools open source. Much appreciated!

I have a few questions related to the same topic. I have some DHS data in various cell types that I would like to train a predictive model for. I initially intended to use Basset, but then saw this in the ReadMe section of Basenji - "Basenji makes predictions in bins across the sequences you provide. You could replicate Basset's peak classification by simply providing smaller sequences and binning the target for the entire sequence." 1) I am unsure what binning the target for the entire sequence means? 2) What would be the process for replicating Bassets peak classification on DHS data using Basenji? It seems that Basenji only scores in 128 bp windows. 3) Can I score full DHS sequences (150 bp)?

In addition to this, I see that Basset was trained on 600 bp sequences and Basenji is trained on much larger sequences. 4) If I want to train on new DHS data would I be able train using the Basenji architecture on smaller DHS sequences (around 150-600 bp) or do I have to use Basset? I would definitely prefer to use Basenji as it is based on tensorflow and lua (which Basset uses) isn't compatible with the Power9 architecture.

I appreciate any help.

Best, Zain

davek44 commented 4 years ago

Hi Zain,

Matching the Basset training procedure in Basenji is doable, but writing the additional code that it would require hasn't made it to the top of my todo list. The easiest way for you to proceed would be to train on the BigWig files for your DHS data in the typical Basenji framework. If you prefer to work with the peaks, then we'll need to write a version of basenji_data.py to write binary labels into the TFRecords from BED files rather than continuous labels from BigWig as it currently does.

The input sequence length and bin size at which the model predicts annotations are easily modifiable parameters.

Best, Dave

doczmp commented 4 years ago

Hi Dave,

Thanks a lot for the response. I appreciate it!

It would be great to have the feature in basenji_data.py to work with peaks instead of BigWigs.

Best, Zain