Open gouthamatla opened 2 years ago
Hi Goutham, peak prediction is tough due to imbalance, and AUPRC will reflect that. Are those the only three datasets that you're training on? In that case, the negatives for one dataset will be the union of the peaks from the other two datasets. Since the descriptions look highly related, that might not be ideal. If you wanted some additional diverse negative examples, you could include more peak BED files from ENCODE. One path to obtaining and preparing them is described here: https://github.com/calico/basenji/tree/master/manuscripts/basset
If you want to change the sequence length, then yes you would modify -l and set seq_length in params differently. You'll also want to consider the behavior of your model after the last convolution when the representation will be flattened across the length axis. If the length is still long, that will make the subsequent dense layer very large and probably impede learning. So you might want to add another convolution block with another pooling operation to get the sequence length down. I generally aim for a length of 5-10 before flattening.
The crop option is more relevant for Basenji where I'm making sequential predictions across the sequence. In that case, I tend to ignore the data on the far ends since I'm missing information off the boundary of the sequence. You can ignore for peak prediction.
Thanks David for very elaborate answer. Is there a way to explicitly provide a negative data set (e.g. a bed file) ? Even if I add data from ENCODE, the other two data sets from my targets would be included in negative data set (union of my two data sets plus encode ) ?
Not easily. I just haven't been working in that sort of setup for awhile. If you add a bunch of ENCODE BEDs, then the negatives from your other targets will make up a small proportion. Plus, if there's no peak observed in one of the datasets, then it is a negative. I'd think you'd want to include that.
Thanks. This is very useful and key, to understand how negative dats is selected. Probably I can have a different model trained for each of my targets mixed with unrelated ENCODE tissues.
Hi David,
I am interested in Basset style predictions based on peaks. I would like to understand how to tune the parameters, especially peak length.
I did a test run with default parameters and it worked well, but low auprc.
If I want to use entire 2kb sequence (+/- 1kb) from the centre of my peaks, should I change the length (-l) to 2000 in basenji_data.py and 'seq_length' in params file while training ? What is the crop option in basenji_data.py ? How does that effect the peak length that is being used ?
Thanks, Goutham A