calico / basenji

Sequential regulatory activity predictions with deep convolutional neural networks.
Apache License 2.0
391 stars 120 forks source link

Basset style predictions - parameter tuning. #136

Open gouthamatla opened 2 years ago

gouthamatla commented 2 years ago

Hi David,

I am interested in Basset style predictions based on peaks. I would like to understand how to tune the parameters, especially peak length.

I did a test run with default parameters and it worked well, but low auprc.

index   auroc   auprc   identifier  description
0   0.73468 0.32834 HI_enhancers_I  Human_islet_enhancers_I
1   0.69157 0.39808 HI_enhancers_II Human_islet_enhancers_II
2   0.70348 0.20886 HI_enhancers_III    Human_islet_enhancers_III

If I want to use entire 2kb sequence (+/- 1kb) from the centre of my peaks, should I change the length (-l) to 2000 in basenji_data.py and 'seq_length' in params file while training ? What is the crop option in basenji_data.py ? How does that effect the peak length that is being used ?

Thanks, Goutham A

davek44 commented 2 years ago

Hi Goutham, peak prediction is tough due to imbalance, and AUPRC will reflect that. Are those the only three datasets that you're training on? In that case, the negatives for one dataset will be the union of the peaks from the other two datasets. Since the descriptions look highly related, that might not be ideal. If you wanted some additional diverse negative examples, you could include more peak BED files from ENCODE. One path to obtaining and preparing them is described here: https://github.com/calico/basenji/tree/master/manuscripts/basset

If you want to change the sequence length, then yes you would modify -l and set seq_length in params differently. You'll also want to consider the behavior of your model after the last convolution when the representation will be flattened across the length axis. If the length is still long, that will make the subsequent dense layer very large and probably impede learning. So you might want to add another convolution block with another pooling operation to get the sequence length down. I generally aim for a length of 5-10 before flattening.

The crop option is more relevant for Basenji where I'm making sequential predictions across the sequence. In that case, I tend to ignore the data on the far ends since I'm missing information off the boundary of the sequence. You can ignore for peak prediction.

gouthamatla commented 2 years ago

Thanks David for very elaborate answer. Is there a way to explicitly provide a negative data set (e.g. a bed file) ? Even if I add data from ENCODE, the other two data sets from my targets would be included in negative data set (union of my two data sets plus encode ) ?

davek44 commented 2 years ago

Not easily. I just haven't been working in that sort of setup for awhile. If you add a bunch of ENCODE BEDs, then the negatives from your other targets will make up a small proportion. Plus, if there's no peak observed in one of the datasets, then it is a negative. I'd think you'd want to include that.

gouthamatla commented 2 years ago

Thanks. This is very useful and key, to understand how negative dats is selected. Probably I can have a different model trained for each of my targets mixed with unrelated ENCODE tissues.