bcgsc / mavis

Merging, Annotation, Validation, and Illustration of Structural variants
http://mavis.bcgsc.ca
GNU General Public License v3.0
72 stars 13 forks source link

generate_ensembl_json.py #207

Closed mattdoug604 closed 4 years ago

mattdoug604 commented 4 years ago

Re-wrote 'generate_ensembl_json.pl' in python to avoid having to deal with Perl dependency issues.

creisle commented 4 years ago

Some warnings showing up that should be fixed

tools/generate_ensembl_json.py:129: DeprecationWarning: time.clock has been deprecated in Python 3.3 and will be removed from Python 3.8: use time.perf_counter or time.process_time instead
  elapsed = time.clock() - last_time_called[0]
tools/generate_ensembl_json.py:136: DeprecationWarning: time.clock has been deprecated in Python 3.3 and will be removed from Python 3.8: use time.perf_counter or time.process_time instead
codecov[bot] commented 4 years ago

Codecov Report

Merging #207 into develop will increase coverage by 10.69%. The diff coverage is n/a.

Impacted file tree graph

@@             Coverage Diff              @@
##           develop     #207       +/-   ##
============================================
+ Coverage    80.48%   91.18%   +10.69%     
============================================
  Files           52       52               
  Lines         9123     9123               
============================================
+ Hits          7343     8319      +976     
+ Misses        1780      804      -976     
Flag Coverage Δ
#unittests 91.18% <ø> (+10.69%) :arrow_up:

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
mavis/annotate/fusion.py 96.02% <0.00%> (+0.33%) :arrow_up:
mavis/bam/cigar.py 98.14% <0.00%> (+0.61%) :arrow_up:
mavis/schedule/job.py 78.26% <0.00%> (+0.72%) :arrow_up:
mavis/cluster/main.py 90.24% <0.00%> (+0.81%) :arrow_up:
mavis/annotate/genomic.py 93.12% <0.00%> (+1.03%) :arrow_up:
mavis/breakpoint.py 94.61% <0.00%> (+1.15%) :arrow_up:
mavis/illustrate/elements.py 96.37% <0.00%> (+1.69%) :arrow_up:
mavis/blat.py 90.27% <0.00%> (+1.85%) :arrow_up:
mavis/align.py 92.65% <0.00%> (+2.23%) :arrow_up:
mavis/annotate/file_io.py 88.73% <0.00%> (+2.25%) :arrow_up:
... and 15 more

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 5f26d29...da4a7a6. Read the comment docs.

mattdoug604 commented 4 years ago

I realized the hugo alias file is a misnomer. The hugo name is the 'gene_name' taken from the pyensembl annotations. We've only used the "hugo file" to add other aliases like RefSeq IDs.

Therefore I think it makes more sense to rename the "hugo file" to a generic "gene alias" file.

Nitin123-4 commented 4 years ago

generate_ensembl_json.py is running but it is very slow.

mattdoug604 commented 4 years ago

The old Ensembl Perl api had a 'canonical transcript' value for each gene that we can't get using the pyensembl package (the reason is the Perl api queries Ensembl directly whereas pyensembl just uses downloaded gtf files, less info).

There are different lists of canonical transcripts floating around (USCS, MANE, APPRIS) but getting those programmatically and mapping them back to Ensembl IDs is going to be a bit of a headache.

What we can do is choose a canonical transcript using the same rules Ensembl uses for choosing a canonical transcript (see http://uswest.ensembl.org/Help/Glossary?id=346):

The canonical transcript is used in the gene tree analysis in Ensembl and does not necessarily reflect the most biologically relevant transcript of a gene. For human, the canonical transcript for a gene is set according to the following hierarchy: 1. Longest CCDS translation with no stop codons. 2. If no (1), choose the longest Ensembl/Havana merged translation with no stop codons. 3. If no (2), choose the longest translation with no stop codons. 4. If no translation, choose the longest non-protein-coding transcript.

mattdoug604 commented 4 years ago

generate_ensembl_json.py is running but it is very slow.

Yes unfortunately it is very slow the first time as it has to query Ensembl a bunch of times to download the protein domains for each protein.

Nitin123-4 commented 4 years ago

Thanks for your reply.

For Human and Mouse genome how long it should take to complete?

Regards

On Thu, 9 Jul 2020 at 21:49, Matt Douglas notifications@github.com wrote:

generate_ensembl_json.py is running but it is very slow.

Yes unfortunately it is very slow the first time as it has to query Ensembl a bunch of times to download the protein domains for each protein.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/bcgsc/mavis/pull/207#issuecomment-656222346, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGFD2GOJXYE7LXGMOJL6E2LR2XUZRANCNFSM4MHLDQCA .

-- Regards

Nitin Mandloi Bioinformatics Associate Scientist

mattdoug604 commented 4 years ago

For Human and Mouse genome how long it should take to complete?

I recall it taking ~1 day. It's heavily dependent on how fast Ensembl's server responds.

That said, I just increased the request rate in the script so hopefully that speeds things up somewhat.