calico / basenji

Sequential regulatory activity predictions with deep convolutional neural networks.
Apache License 2.0
396 stars 121 forks source link

Question about Gene Expression Training Preprocessing #161

Open wkl1990 opened 1 year ago

wkl1990 commented 1 year ago

Hi Dave (@davek44 ),

I recently read your 2018 Basenji paper, where you referred to cell-type-specific gene expression. In the paper, you mentioned that you made predictions in the 128-bp bin containing each transcription start site (TSS), and for each gene outside the training set, you summed their various TSS values to compute accuracy statistics.

I was wondering if you could clarify whether you filtered the bigwig data outside the TSS or the training set outside the TSS. I'm new to Basenji and would greatly appreciate your help in understanding this aspect of preprocessing.

Thank you!

davek44 commented 1 year ago

I'm not sure what you mean by "filter the bigwig data". We train on the whole genome, other than highly repetitive and unmappable regions.

wkl1990 commented 1 year ago

Hello @davek44 , thank you for your response. To clarify, do you mean training the model on the entire genome but only making predictions on the TSS region? Additionally, I am curious about how you generated the bigwig file for the expression data. Were they created in the same way as the DNase data, directly from the bam file? If I use regular RNA-seq data, would I just keep the TSS reads to generate the bigwig signal?

davek44 commented 1 year ago

We train on the entire genome, and we make predictions across entire sequences. The model doesn't understand the concept of a TSS. You, the analyst, need to go in afterwards and pull out predictions at TSS if that's what you're interested in.

All BigWig files were created using a similar workflow from BAM files.

You cannot use RNA-seq. Only 5' RNA sequencing techniques like CAGE, GRO-seq, or PRO-seq will work.