calico / basenji

Sequential regulatory activity predictions with deep convolutional neural networks.
Apache License 2.0
400 stars 122 forks source link

How are TSSs extracted from GENCODE files? #110

Open cristianregep opened 2 years ago

cristianregep commented 2 years ago

I'm trying to replicate for benchmarking a few results from the 2020 manuscript and a list of TSS is necessary (for example https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1008050#pcbi.1008050.s013 ). This also applies to some analyses in the 2021 Enformer paper, and also to the Wang et al 2021 Expression modifying fine-mapping paper https://www.nature.com/articles/s41467-021-23134-8 where Basenji was used.

I saw that in the Basenji 2020 paper and the Enformer 2021 paper GENCODE was used as a source (V28 and V32 respectively). The Wang et al paper doesn't mention the source. GENCODE has many different transcripts per gene and one can in theory extract multiple TSSs per gene. My question is how was the TSS identified? Was there more than one per gene, and is the same method used across the 3 papers mentioned (Basenji 2020, Enformer 2021, Wang et al 2021)? Also what is a unique gene (a unique ensemble gene ID or a unique HGNC ID, or something else)?

davek44 commented 2 years ago

In each paper, genes were defined uniquely based on GENCODE gene_id's.

For the Enformer paper, I'm not sure what analyses you are referring to. We benchmarked RNA-seq gene expression prediction against Expecto using a dataset that those authors curated. They chose one TSS per gene using a strategy that involved FANTOM CAGE peaks. See that paper and their github for details and data.

We also filtered some of the analyses for distance to TSS. In that case, we would have computed the closest of any TSS in the GENCODE GTF file as the annotation. I believe that's what Qingbo would've done in his paper, too, but you'd have to email him to confirm.

In the 2020 cross-species paper, the link takes you to a supplementary figure where I filtered variants based on TSS distance. Again, I would've computed the closest of any TSS in the GENCODE GTF.