ImperialCardioGenetics / uORFs

18 stars 2 forks source link

5' UTR definition #2

Open arq5x opened 4 years ago

arq5x commented 4 years ago

Great paper! The methods state "The start and end positions and sequence of the 5’UTRs of all protein-coding genes were downloaded from Ensembl biomart (Human genes GRCh37.p13) and filtered to only include canonical transcripts. Genes with no annotated 5’UTR on the canonical transcript were removed." I am having trouble recreating this from BioMart. How did you define "canonical"? Do you have to have an example of the BioMart settings to do this?

nickywhiff commented 4 years ago

Hi Aaron, thanks!

I used the following biomart settings: Dataset: Human genes (GRCh37.p13) Filters: Gene type: protein_coding Attributes: Transcript stable ID Gene name 5' UTR start 5' UTR end Transcription start site (TSS) Strand CDS start CDS end Exon region start (bp) Exon region end (bp) cDNA coding start cDNA coding end Exon rank in transcript Chromosome/scaffold name

I originally defined canonical by pulling from the Ensembl API, but I found this was missing info for some genes, so in the end I took those flagged as canonical in the LOEUF file from the gnomAD flagship paper. I hope this helps to recreate it!

arq5x commented 4 years ago

Ah, that makes sense. Thanks much.

arq5x commented 4 years ago

Could you specific exactly which file for LOEUF you used and how you handled multiple canonical transcripts if they exist (APPRIS often denotes multiple transcripts as primary)?

nickywhiff commented 4 years ago

It was the by transcript TSV file here: https://gnomad.broadinstitute.org/downloads#v2-constraint I believe I took any transcript that was flagged as canonical, so if there were multiple I would have used all possible uAUG-creating variants from any of them.