Open arq5x opened 4 years ago
Hi Aaron, thanks!
I used the following biomart settings: Dataset: Human genes (GRCh37.p13) Filters: Gene type: protein_coding Attributes: Transcript stable ID Gene name 5' UTR start 5' UTR end Transcription start site (TSS) Strand CDS start CDS end Exon region start (bp) Exon region end (bp) cDNA coding start cDNA coding end Exon rank in transcript Chromosome/scaffold name
I originally defined canonical by pulling from the Ensembl API, but I found this was missing info for some genes, so in the end I took those flagged as canonical in the LOEUF file from the gnomAD flagship paper. I hope this helps to recreate it!
Ah, that makes sense. Thanks much.
Could you specific exactly which file for LOEUF you used and how you handled multiple canonical transcripts if they exist (APPRIS often denotes multiple transcripts as primary)?
It was the by transcript TSV file here: https://gnomad.broadinstitute.org/downloads#v2-constraint I believe I took any transcript that was flagged as canonical, so if there were multiple I would have used all possible uAUG-creating variants from any of them.
Great paper! The methods state "The start and end positions and sequence of the 5’UTRs of all protein-coding genes were downloaded from Ensembl biomart (Human genes GRCh37.p13) and filtered to only include canonical transcripts. Genes with no annotated 5’UTR on the canonical transcript were removed." I am having trouble recreating this from BioMart. How did you define "canonical"? Do you have to have an example of the BioMart settings to do this?