genome-nexus / genome-nexus-importer

Import data into MongoDB for use by https://github.com/genome-nexus/genome-nexus/
MIT License
4 stars 16 forks source link

Update grch38 ensembl95 canonical transcript #62

Closed leexgh closed 1 year ago

leexgh commented 2 years ago

Fix:https://github.com/genome-nexus/genome-nexus/issues/630 Reason to update transcripts: AMY1C canonical transcript differs in mskcc and ensembl. We found it is because /data/grch38_ensembl95/input/ensembl_biomart_geneids.txt is outdated, so we didn't call ensembl for the new transcript. See details: https://github.com/genome-nexus/genome-nexus/issues/630#issuecomment-1232492016 Changes in this pull request:

  1. Update /data/grch38_ensembl95/input/ensembl_biomart_geneids.txt. Run following commands:
    export VERSION=grch38_ensembl95
    export SPECIES=homo_sapiens
    make $VERSION/input/ensembl_biomart_ccds.txt $VERSION/input/ensembl_biomart_geneids.txt $VERSION/input/ensembl_biomart_refseq.txt $VERSION/input/ensembl_biomart_pfam.txt
  2. Generate all files
    make all VERSION=grch38_ensembl95 GFF3_URL=ftp://ftp.ensembl.org/pub/release-95/gff3/homo_sapiens/Homo_sapiens.GRCh38.95.gff3.gz QSIZE=1000 SPECIES=homo_sapiens

Todo: If this looks good, we probably need to rerun for grch37 as well.

leexgh commented 2 years ago

@inodb Do you think we should pull the latest ensembl files every time? For example adding

make $VERSION/input/ensembl_biomart_ccds.txt $VERSION/input/ensembl_biomart_geneids.txt $VERSION/input/ensembl_biomart_refseq.txt $VERSION/input/ensembl_biomart_pfam.txt

into make all Or probably not every time but update it at a certain frequency

leexgh commented 1 year ago

The value in original hotspots file is different, so decide to remove all hotspots related updates in this pull request.

https://raw.githubusercontent.com/cBioPortal/cancerhotspots/03b7523b2bc26178f8466f41ec943ad97b23c0cc/webapp/src/main/resources/data/v2_multi_type_residue.txt

(Download)

TP53    R213    R:256   213             P:2|Q:25|G:4|*:208|L:17 1.2909718119**355**e-203

vs (Current)

TP53    R213    R:256   213             P:2|Q:25|G:4|*:208|L:17 1.2909718119**354995**e-203 
pieterlukasse commented 1 year ago

@leexgh I've traced back the source of the differences and from what I can see differences are most likely caused by a rounding issue when @sandertan processed everything back in 2019. The values in the current hotspots_v2_and_3d.txt already deviate from the original values in v2_multi_type_residue.txt. E.g.:

line from *hotspots_v2_and_3d.txt*   : >> EGFR  L858    ...0.0  3.411226984648378e-276
line from *v2_multi_type_residue.txt*: >> EGFR  L858    ...0    3.41122698464838e-276   

Please apply these code changes to combine_2d_3d_add_mutation_type_counts_and_filter.py. This should ensure that the values in the final file correctly reflect the original values grabbed from 3d_hotspots.txt and v2_multi_type_residue.txt:

diff --git a/scripts/hotspots/combine_2d_3d_add_mutation_type_counts_and_filter.py b/scripts/hotspots/combine_2d_3d_add_mutation_type_counts_and_filter.py
index bd99760..370f7f0 100644
--- a/scripts/hotspots/combine_2d_3d_add_mutation_type_counts_and_filter.py
+++ b/scripts/hotspots/combine_2d_3d_add_mutation_type_counts_and_filter.py
@@ -62,12 +62,12 @@ if __name__ == "__main__":
     args = parser.parse_args()

-    hotspots_2d = pd.read_csv(args.hotspots_2d, sep="\t")
+    hotspots_2d = pd.read_csv(args.hotspots_2d, sep="\t", dtype=str)
     hotspots_2d.columns = [c.lower().replace("-","_") for c in hotspots_2d.columns]
-    hotspots_2d['type'] = hotspots_2d.indel_size.fillna(0).apply(lambda x: "in-frame indel" if x > 0 else "single residue")
+    hotspots_2d['type'] = hotspots_2d.indel_size.fillna(0).apply(lambda x: "in-frame indel" if int(x) > 0 else "single residue")
     hotspots_2d.loc[((hotspots_2d.type == "single residue") & hotspots_2d.residue.str.contains("X")), 'type'] = "splice site"

-    hotspots_3d = pd.read_csv(args.hotspots_3d, sep="\t")
+    hotspots_3d = pd.read_csv(args.hotspots_3d, sep="\t", dtype=str)
     hotspots_3d.columns = [c.lower().replace("-","_") for c in hotspots_3d.columns]
     # add type column
     hotspots_3d['type'] = '3d'
leexgh commented 1 year ago

@pieterlukasse Thanks for pointing it out! ensembl_biomart_pfam.txt is generated by retrieve_biomart_tables.R. I've checked the retrieve_biomart_tables.R, it doesn't include ENSP id column, so that's why the new file is missing the whole column: image By checking the history, this file has never been modified since the initial commit. I'm not sure why the original file contains ENSP column.

I also checked where ensembl_biomart_pfam.txt is being used in the codebase, none of them uses "ENSP" values.