UcarLab / CoRE-ATAC

MIT License
9 stars 1 forks source link

how does it work? #2

Open AnjaliC4 opened 3 years ago

AnjaliC4 commented 3 years ago

Hi, thanks for this nice tool! I wanted to understand the usage of CORE-ATAC conceptually. So, for creating training-set peaks, we will need peaks regions identified from bulk-ATAC-seq? I have scATAC-seq and plan on running this on cell-type specific peaks as test data for predicting enhancers. Not sure what to train the data on.

In addition, you mention the use of SNPs - how do you suggest we get this information? Genotyping the samples or do you think calling SNPs using 1000genome reference on cell-type peaks work? At which steps will the information be integrated and what will be result of this? Can it indentify caQTLS or SNPs falling on enhancers/promoters?

Thanks for your time.

ajt986 commented 3 years ago

Hello,

For predicting peaks as promoters, enhancers, insulators, or other classifications you will need 1) a set of peaks and 2) an alignment file (.bam) from the ATAC-seq assay. For scATAC-seq, this means creating pseudo-bulk alignment files of the reads from the cells in the cluster of interest. The feature encoder steps will parse through the reads at the peaks and encode various features such as DNA-sequence and ATAC-seq read pileups/insert sizes. Pretrained models are provided in releases, so you can run the feature encoder and then use the encoded features in the model predictor to predict which peaks correspond to promoters, enhancers, insulators or "other". Here, other is used for cis-regulatory elements not identified as promoter, enhancer or insulator. We tried to predict more specific ChromHMM states, but the models we tried could not discriminate them.

Model training will require also providing promoter, enhancer, insulator, and other annotations for each peak, (i.e., from reference ChromHMM/ChIP-seq data). You can refer to roadmap/ENCODE for some reference cell types. You will need to annotate the peaks to let the model know which peaks correspond to the cis-RE classes. The scripts weren't set up for transfer learning, so to do this, a little programming will be required to train a new model using the pre-trained models provided. Otherwise, the model will be trained from scratch and may perform very well.

SNPs are inferred from the encoding steps from the ATAC-seq reads when there are 10 or more reads piled up at a location. When this occurs, instead of using a one-hot encoding of the reference, the frequency of the base observed at that position is used instead. For example, if 3 reads observe A and 7 observe C, the encoding would use 0.3 for the A position and 0.7 for the C position in the 4x600 matrix encoding the DNA sequence.

If the genotype data is known, tools such as GATK https://gatk.broadinstitute.org/hc/en-us/articles/360037594571-FastaAlternateReferenceMaker can be used to generate fasta files with the genotype SNPs. The reference created can then be used instead when encoding data with CoRE-ATAC.

CoRE-ATAC doesn't identify caQTLs, but we observed changes in enhancer/promoter annotation probabilities at caQTLs in Islets. RASQUAL was used for identifying caQTLs https://www.nature.com/articles/ng.3467