Danko-Lab / clipnet

A deep learning approach to predicting transcription initiation from sequence at single nucleotide resolution
MIT License
10 stars 2 forks source link

Can the standard reference genome be directly used as input? #1

Closed zhangjy859 closed 7 months ago

zhangjy859 commented 7 months ago

Hi,

I noted you said: This script takes a fasta file containing 1000 bp records and outputs an hdf5 file containing the predictions for each record. for predict_ensemble.py, I want to know is it possible to run predict_ensemble.py with a known whole reference genome (such as hg38) for a genomewide search or its limited to short records (like 1000bp) in fasta file?

Best.

Zhang,

adamyhe commented 7 months ago

Hi Zhang,

No, the model itself only supports predicting on 1kb input sequences. In principle, you could fragment the entire genome and stitch the predictions together (keeping in mind that CLIPNET only outputs a prediction for the middle 500 bp), but I wouldn't really recommend it because:

1) It would be pretty compute intensive. 2) Most of the genome does not have active transcription initiation. It would be a lot more efficient to predict at, say, 1kb regions around cCREs from ENCODE or ATAC-seq peaks. 3) Most enhancers and promoters are cell-type specific, with their cell-type specific activity being driven by non-sequence factors like chromatin state, TF expression, etc. You'll end up with erroneous initiation predictions at many of these inactive regulatory elements.

If you have a specific use case in mind, please let me know and I can brainstorm some ideas.

Adam

zhangjy859 commented 7 months ago

Hi,

thank you for your answer. I understand. I noticed your article before, and it was very interesting. I have some confuse about the description mentioned in this issue of the document. At the same time, I am also interested in whether it can work in non-model organisms with poorly annotated transcriptomes. You have answered my questions. Thank you!

adamyhe commented 7 months ago

Oh, yeah, I would not use this to call regulatory elements in poorly annotated genomes. You'll likely need some functional data to reliably call enhancers (promoters you can probably call from gene annotations).