gagneurlab / drop

Pipeline to find aberrant events in RNA-Seq data, useful for diagnosis of rare disorders
MIT License
128 stars 43 forks source link

Annotating results with Ensembl Id or HGNC Id #508

Closed Lucpen closed 3 weeks ago

Lucpen commented 5 months ago

Hi,

Thanks for such an amazing work with DROP, it is very useful and we are using it very often to look for aberrant expression and splicing in rare disease patients.

I wanted to ask you if there is a simple way to get FRASER output annotated with either Ensembl gene ID or HGNC gene ID other than converting the provided hgncSymbol after running DROP. I am unsure if this is default behaviour or if its caused by the files we provide it, so I have attached drop_config.txt and below you can find the initial lines of the gtf file used.

Thanks for your help,

Lucía

##description: evidence-based annotation of the human genome (GRCh38), version 37 (Ensembl 103)
##provider: GENCODE
##contact: gencode-help@ebi.ac.uk

##format: gtf
##date: 2020-12-07
chr1    HAVANA  gene    11869   14409   .   +   .   gene_id "ENSG00000223972.5"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; level 2; hgnc_id "HGNC:37102"; havana_gene "OTTHUMG00000000961.2";
chr1    HAVANA  transcript  11869   14409   .   +   .   gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_name "DDX11L1-202"; level 2; transcript_support_level "1"; hgnc_id "HGNC:37102"; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";
chr1    HAVANA  exon    11869   12227   .   +   .   gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_name "DDX11L1-202"; exon_number 1; exon_id "ENSE00002234944.1"; level 2; transcript_support_level "1"; hgnc_id "HGNC:37102"; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";
chr1    HAVANA  exon    12613   12721   .   +   .   gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_name "DDX11L1-202"; exon_number 2; exon_id "ENSE00003582793.1"; level 2; transcript_support_level "1"; hgnc_id "HGNC:37102"; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";
chr1    HAVANA  exon    13221   14409   .   +   .   gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_name "DDX11L1-202"; exon_number 3; exon_id "ENSE00002312635.1"; level 2; transcript_support_level "1"; hgnc_id "HGNC:37102"; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";
vyepez88 commented 5 months ago

Hi Lucía, Hope all is good on your side! Can you send one row of your FRASER results to see how the gene looks like? I always use the Gencode gtf file for my projects. The hgncSymbol corresponds to the HGNC gene ID, or what exactly are you looking for? To add Ensembl IDs, DROP outputs the file {root}/processed_data/preprocess/{gene_annotation}/gene_name_mapping_{gene_annotation}.tsv that contains both the hgncSymbol and the Ensembl IDs and can be used to merge with splicing and expression results.

Lucpen commented 5 months ago

Hi Vicente!

Thanks for the fast response!

All is good, I hope it is with you too! :)

This is the first line of our results: ''' sampleID seqnames start end width strand hgncSymbol type pValue psiValue deltaPsi counts totalCounts meanCounts meanTotalCounts nonsplitCounts nonsplitProportion nonsplitProportion_99quantile annotatedJunction pValueGene padjustGene PAIRED_END INDIVIDUAL_ID DNA_ID DROP_GROUP SPLICE_COUNTS_DIR HPO_TERMS GENE_COUNTS_FILE GENE_ANNOTATION GENOME SEX LIBPREP TISSUE AFFECTED isExternal potentialImpact causesFrameshift UTR_overlap blacklist ''' hgncSymbol would be something like NCF1C and we would like to get either the hgnc id 32523 or Ensembl id ENSG00000165178. Is it possible for this to happen inside DROP as it happens for OUTRIDER (we use the same gtf and the results come with the Ensembl id)? Is this something related to the reference files we are using?

Thanks again!

vyepez88 commented 3 weeks ago

Hi Lucia! So sorry, I thought I had answered this. As commented at the ESHG, the workarounds for this would be:

  1. DROP uses the 'gene_id' column of the provided gtf file. Therefore, if you provide a gtf file with the hgnc ids as the main ids, the OUTRIDER tables would have those ids as unique identifiers. See this script for more info: https://github.com/gagneurlab/drop/blob/master/drop/template/Scripts/Pipeline/preprocessGeneAnnotation.R However, I advice against having numbers as gene identifiers and keeping the ensemble ids
  2. You can edit the https://github.com/gagneurlab/drop/blob/master/drop/modules/aberrant-expression-pipeline/Counting/mergeCounts.R script to add a step to merge the colData of the ods object with hgnc ids
  3. My preferred one and maybe the easiest: you can leave DROP as is and then make follow-up scripts where you add your info of preference (e.g. hgnc ids, OMIM genes, Phenotype, Family ID, Affected status, etc.).
Lucpen commented 3 weeks ago

No worries. Thanks Vicente! :D