FunctionLab / selene

a framework for training sequence-level deep learning networks
BSD 3-Clause Clear License
373 stars 89 forks source link

Custom interval file #95

Closed gouthamatla closed 5 years ago

gouthamatla commented 5 years ago

Hi All,

I am interested to use selene on my own data. I am thinking to try couple of things.

One is to train the selene on TF ChIP-Seq data of my interest and perform ISM. I think this is exactly CASE1 of Selene paper. I have data from hg19, so I should use hg19 fasta and hg19 interval files ? Where can I find hg19 intervals file ?

On the other hand, I have enhancers from tissue of my interest..., which might not have one TF binding sites, but they can have multiple TFs binding sites. In this case, have you tested selenes performance on diverse sequences like enhancers ? Is there any way to get the top saliency features from selene ?

In any case, I understood that intervals file is used to create training, validation and test sets. Am I correct ? Can I use non-enhancer open chromatin regions from tissue of my interest as interval file to run CLI ?

Thanks, Goutham A

kathyxchen commented 5 years ago

Hi Goutham,

You can use the intervals file we provide (regions where DeepSEA training/validation/testing data contains at least 1 TF) or come up with your own (non-enhancer open chromatin regions sounds fine if it fits with your use case). The intervals are just the regions that you want to sample from - you'd only use it if you think there should be restrictions on the regions from which Selene can generate samples.

Yes, you should use hg19 if your data is hg19.

What do you mean by top saliency features? (I don't think we provide that kind of functionality though.) Do you mean whether sequence-level deep learning models can identify TF binding in enhancer regions? We haven't looked specifically at enhancer regions but I don't think there would be a huge difference region-to-region in the genome for model prediction accuracy.

gouthamatla commented 5 years ago

Thanks, where can I find hg19 intervals file ? I guess it should be here:

wget https://zenodo.org/record/1443558/files/selene_quickstart.tar.gz

kathyxchen commented 5 years ago

https://github.com/FunctionLab/selene/tree/master/manuscript/case1/data#additional-note

You can just liftOver from hg38 back to hg19 if that's faster. Otherwise I don't know if we provide it (if we use IntervalsSampler in case 2 it's probably downloadable there) - might need to regenerate from the DeepSEA data + the script linked