calico / basenji

Sequential regulatory activity predictions with deep convolutional neural networks.
Apache License 2.0
391 stars 120 forks source link

use basenji to motif discovery - motif scanning #116

Open moxgreen opened 2 years ago

moxgreen commented 2 years ago

Dear basenji developers, I would like to use basenji to model ChIP-seq data. In particular I would like to train a model using the ChIP-seq data for a certain transcription factor (TF). The data can be in bed format (peaks) or signal in BigWig format if more suitable for basenji. I do not pretend to have explainable models like PWM, I just would like to have a model able to predict the binding of a sequence by the transcription factor of interest.

Having the precomputed model I would like to apply it on other sequences (e.g. coming from an ATAC-seq experiment, or the entire genome) and predict if those sequences are expected to be bound by TF or not.

Is basenji suitable for this purpose? Should I use basset insthead?

I was able to apply basenji_train.py and basenji_test.py on the test data you provided. One of the difficult steps for me is to design a model.json suitable for my needs. In particular I see that some parameters in the provided models (e.g. https://github.com/calico/basenji/blob/master/testdata/params.small.hd5.txt) are clearly dependent on the input (e.g. seq_length). I'm not an expert of CNN, I would like to use a "standard" architecture but I have to at least carefully set all parameters that clearly depend on the input, to me it is not clear which are those parameters.

Thanks for any advice.

davek44 commented 2 years ago

Hi, I believe the basset path is more straightforward for your application. I would add your dataset to the DNase compendium, so you have a bunch of tough negative examples, too. The procedure is described here: https://github.com/calico/basenji/blob/master/manuscripts/basset/make_dataset.sh

Then you can use the parameters described here: https://github.com/calico/basenji/blob/master/manuscripts/basset/params_basset.json

Let me know if you encounter any issues!

moxgreen commented 2 years ago

Many thanks, I will try basset.