Ensembl / ensembl-vep

The Ensembl Variant Effect Predictor predicts the functional effects of genomic variants
https://www.ensembl.org/vep
Apache License 2.0
445 stars 151 forks source link

Missed sorting of Consequence field #1066

Closed Stikus closed 1 year ago

Stikus commented 2 years ago

Description

We found similar differences as described here - now in Consequence field.

System

Full VEP command line

/soft/ensembl-vep-104.3/vep --fork 96 --input_file /inputs/770500172011_S11_manta.diploidSV.vcf --format vcf --cache --dir_cache /ref/.vep --dir_plugins /soft/ensembl-vep-104.3/Plugins --assembly GRCh38 --offline --output_file /inputs/vep_104_1/770500172011_S11_manta-diploidSV.annot-vep.vcf --vcf --force_overwrite --symbol --check_existing --terms SO --tsl --hgvs --hgvsg --transcript_version --failed 1 --af --max_af --af_1kg --af_esp --af_gnomad --pubmed --sift b --polyphen b --variant_class --gene_phenotype --regulatory --numbers --domains --protein --canonical --ccds --uniprot --biotype --shift_hgvs 1 --xref_refseq --flag_pick_allele --no_escape --minimal --allele_number --total_length     --no_stats

Data files (if applicable)

chr1  209761990       MantaDEL:8732:1:2:0:0:0 AGCAGGAGAAGCCTGTGTGGCTCTGAAAGCTTGAGTGGGCTACTTTGGTTTTGTGGATCTGAAACATTGTCTTCGTCCTGTAATGAATTACCACAGACTGGGTGGCTTAAATAACAGAAATTTATTTCTTACAATTCTGGAGTCTGGAGTTCTAAACCCATGGATTCTGCAGATCTGGTGTCTGATGAGAGCACTCTTCCCACTATGCAGATGGCCTTCTTCTCATTGTGTCCTCACAAGGCAGAGAGCAGAGAGAGAAAGCAAGCTCTCTCGTGTCTCTTCTAATAATGGCACTAATCCCATTCATGAGGGCTACCTGCTCTGACCTAATTATTTCTCAAAGTCTCCACCTCCTAATACTATCACATTGGGAGATAGGATTTCAACACACAGATTTGGGCGGGGGGACAATAAACACTCAGTCCATAACAAACATGATGGTTAGTTCTTCCTTCCGAAATCATCAGGAGAGTCTTTCAAGAGGCTCCAGGCAGTGGTTTTCAGCATTCCCCACTTGCCTTCCAGGTGACAGGCACCAGCTCTGAAGTCTTTCCAGCCCAGCATCCTCCTCCCTCAGGCATCTGCAGGGATCTGTCTGACCACCTCTCCTCACAGGCTGGGGGCCTTCCTCCACAGGACACTCCCATCAAGAAGCCACCCAAACACCACCGTGGTAAGAGCAGAGCCTCCCTCACTCCACAGGGCCTGCAGAGAATCCTGAGACAACTGTCCCAGCTCACTG  A       141     PASS    END=209762731;SVTYPE=DEL;SVLEN=-741;CIGAR=1M741D;CIPOS=0,3;HOMLEN=3;HOMSEQ=GCA;CSQ=-|splice_donor_variant&splice_acceptor_variant&coding_sequence_variant&intron_variant|HIGH|TRAF3IP3|ENSG00000009790|Transcript|ENST00000367024.5|protein_coding|4/17|3-4/16|ENST00000367024.5:c.346-521_493+72del||-/2331|-/1656|-/551||||1||1|||deletion|1|HGNC|HGNC:30766||2|CCDS1490.2|ENSP00000355991|Q9Y228.146||UPI00005190E1|Q9Y228-1|NM_001320143.2||||||chr1:g.209761994_209762734del||||||||||||||||||||||||||||,-|splice_donor_variant&splice_acceptor_variant&coding_sequence_variant&intron_variant|HIGH|TRAF3IP3|ENSG00000009790|Transcript|ENST00000367025.8|protein_coding|4/17|3-4/16|ENST00000367025.8:c.346-521_493+72del||-/2234|-/1656|-/551||||1||1||1|deletion|1|HGNC|HGNC:30766|YES|1|CCDS1490.2|ENSP00000355992|Q9Y228.146||UPI00005190E1|Q9Y228-1|NM_025228.4||||||chr1:g.209761994_209762734del||||||||||||||||||||||||||||,-|splice_donor_variant&splice_acceptor_variant&coding_sequence_variant&intron_variant|HIGH|TRAF3IP3|ENSG00000009790|Transcript|ENST00000367026.7|protein_coding|4/17|3-4/16|ENST00000367026.7:c.286-521_433+72del||-/2058|-/1596|-/531||||1||1|||deletion|1|HGNC|HGNC:30766||1|CCDS81422.1|ENSP00000355993|Q9Y228.146||UPI000006E12F|Q9Y228-2|NM_001320144.2||||||chr1:g.209761994_209762734del||||||||||||||||||||||||||||,-|splice_donor_variant&splice_acceptor_variant&coding_sequence_variant&intron_variant|HIGH|TRAF3IP3|ENSG00000009790|Transcript|ENST00000400959.7|protein_coding|4/14|3-4/13|ENST00000400959.7:c.286-521_433+72del||-/1854|-/1215|-/404||||1||1|||deletion|1|HGNC|HGNC:30766||5||ENSP00000383743||E2QRE5.62|UPI0000D4D397||||||||chr1:g.209761994_209762734del||||||||||||||||||||||||||||,-|splice_acceptor_variant&non_coding_transcript_exon_variant&intron_variant|HIGH|TRAF3IP3|ENSG00000009790|Transcript|ENST00000468672.5|processed_transcript|4/4|3/3|||-/570||||||1||1|||deletion|1|HGNC|HGNC:30766||4|||||||||||||chr1:g.209761994_209762734del||||||||||||||||||||||||||||,-|splice_donor_variant&splice_acceptor_variant&coding_sequence_variant&intron_variant&NMD_transcript_variant|HIGH|TRAF3IP3|ENSG00000009790|Transcript|ENST00000478359.5|nonsense_mediated_decay|4/13|3-4/12|ENST00000478359.5:c.346-521_493+72del||-/1860|-/1062|-/353||||1||1|||deletion|1|HGNC|HGNC:30766||1||ENSP00000417665|Q9Y228.146||UPI00005EE230|Q9Y228-3|||||||chr1:g.209761994_209762734del||||||||||||||||||||||||||||,-|downstream_gene_variant|MODIFIER|TRAF3IP3|ENSG00000009790|Transcript|ENST00000479796.5|protein_coding|||||||||||1|1874|1|cds_end_NF||deletion|1|HGNC|HGNC:30766||4||ENSP00000419180||C9JXB3.54|UPI0001B79901||||||||chr1:g.209761994_209762734del||||||||||||||||||||||||||||  GT:FT:GQ:PL:PR:SR       0/1:PASS:141:191,0,719:29,2:42,6
chr1  209761990       MantaDEL:8732:1:2:0:0:0 AGCAGGAGAAGCCTGTGTGGCTCTGAAAGCTTGAGTGGGCTACTTTGGTTTTGTGGATCTGAAACATTGTCTTCGTCCTGTAATGAATTACCACAGACTGGGTGGCTTAAATAACAGAAATTTATTTCTTACAATTCTGGAGTCTGGAGTTCTAAACCCATGGATTCTGCAGATCTGGTGTCTGATGAGAGCACTCTTCCCACTATGCAGATGGCCTTCTTCTCATTGTGTCCTCACAAGGCAGAGAGCAGAGAGAGAAAGCAAGCTCTCTCGTGTCTCTTCTAATAATGGCACTAATCCCATTCATGAGGGCTACCTGCTCTGACCTAATTATTTCTCAAAGTCTCCACCTCCTAATACTATCACATTGGGAGATAGGATTTCAACACACAGATTTGGGCGGGGGGACAATAAACACTCAGTCCATAACAAACATGATGGTTAGTTCTTCCTTCCGAAATCATCAGGAGAGTCTTTCAAGAGGCTCCAGGCAGTGGTTTTCAGCATTCCCCACTTGCCTTCCAGGTGACAGGCACCAGCTCTGAAGTCTTTCCAGCCCAGCATCCTCCTCCCTCAGGCATCTGCAGGGATCTGTCTGACCACCTCTCCTCACAGGCTGGGGGCCTTCCTCCACAGGACACTCCCATCAAGAAGCCACCCAAACACCACCGTGGTAAGAGCAGAGCCTCCCTCACTCCACAGGGCCTGCAGAGAATCCTGAGACAACTGTCCCAGCTCACTG  A       141     PASS    END=209762731;SVTYPE=DEL;SVLEN=-741;CIGAR=1M741D;CIPOS=0,3;HOMLEN=3;HOMSEQ=GCA;CSQ=-|splice_acceptor_variant&splice_donor_variant&coding_sequence_variant&intron_variant|HIGH|TRAF3IP3|ENSG00000009790|Transcript|ENST00000367024.5|protein_coding|4/17|3-4/16|ENST00000367024.5:c.346-521_493+72del||-/2331|-/1656|-/551||||1||1|||deletion|1|HGNC|HGNC:30766||2|CCDS1490.2|ENSP00000355991|Q9Y228.146||UPI00005190E1|Q9Y228-1|NM_001320143.2||||||chr1:g.209761994_209762734del||||||||||||||||||||||||||||,-|splice_acceptor_variant&splice_donor_variant&coding_sequence_variant&intron_variant|HIGH|TRAF3IP3|ENSG00000009790|Transcript|ENST00000367025.8|protein_coding|4/17|3-4/16|ENST00000367025.8:c.346-521_493+72del||-/2234|-/1656|-/551||||1||1||1|deletion|1|HGNC|HGNC:30766|YES|1|CCDS1490.2|ENSP00000355992|Q9Y228.146||UPI00005190E1|Q9Y228-1|NM_025228.4||||||chr1:g.209761994_209762734del||||||||||||||||||||||||||||,-|splice_acceptor_variant&splice_donor_variant&coding_sequence_variant&intron_variant|HIGH|TRAF3IP3|ENSG00000009790|Transcript|ENST00000367026.7|protein_coding|4/17|3-4/16|ENST00000367026.7:c.286-521_433+72del||-/2058|-/1596|-/531||||1||1|||deletion|1|HGNC|HGNC:30766||1|CCDS81422.1|ENSP00000355993|Q9Y228.146||UPI000006E12F|Q9Y228-2|NM_001320144.2||||||chr1:g.209761994_209762734del||||||||||||||||||||||||||||,-|splice_acceptor_variant&splice_donor_variant&coding_sequence_variant&intron_variant|HIGH|TRAF3IP3|ENSG00000009790|Transcript|ENST00000400959.7|protein_coding|4/14|3-4/13|ENST00000400959.7:c.286-521_433+72del||-/1854|-/1215|-/404||||1||1|||deletion|1|HGNC|HGNC:30766||5||ENSP00000383743||E2QRE5.62|UPI0000D4D397||||||||chr1:g.209761994_209762734del||||||||||||||||||||||||||||,-|splice_acceptor_variant&non_coding_transcript_exon_variant&intron_variant|HIGH|TRAF3IP3|ENSG00000009790|Transcript|ENST00000468672.5|processed_transcript|4/4|3/3|||-/570||||||1||1|||deletion|1|HGNC|HGNC:30766||4|||||||||||||chr1:g.209761994_209762734del||||||||||||||||||||||||||||,-|splice_acceptor_variant&splice_donor_variant&coding_sequence_variant&intron_variant&NMD_transcript_variant|HIGH|TRAF3IP3|ENSG00000009790|Transcript|ENST00000478359.5|nonsense_mediated_decay|4/13|3-4/12|ENST00000478359.5:c.346-521_493+72del||-/1860|-/1062|-/353||||1||1|||deletion|1|HGNC|HGNC:30766||1||ENSP00000417665|Q9Y228.146||UPI00005EE230|Q9Y228-3|||||||chr1:g.209761994_209762734del||||||||||||||||||||||||||||,-|downstream_gene_variant|MODIFIER|TRAF3IP3|ENSG00000009790|Transcript|ENST00000479796.5|protein_coding|||||||||||1|1874|1|cds_end_NF||deletion|1|HGNC|HGNC:30766||4||ENSP00000419180||C9JXB3.54|UPI0001B79901||||||||chr1:g.209761994_209762734del||||||||||||||||||||||||||||  GT:FT:GQ:PL:PR:SR       0/1:PASS:141:191,0,719:29,2:42,6

Here are the differences:

Can you fix this please?

helensch commented 2 years ago

Hi @stikus

Thank you for this report and the example input variant. I have been able to reproduce the issue.

I will let you know when we have fix in place to report the consequences in the same order in repeated runs.

Regards Helen

Stikus commented 2 years ago

Hello, any progress about this? Our internal tests sometimes fail due to missed sorting, even on VEP 105.

Maybe we can help somehow? Looks like somewhere here sorting should be added but i'm bad with Perl: https://github.com/Ensembl/ensembl-vep/blob/release/105/modules/Bio/EnsEMBL/VEP/OutputFactory.pm#L1251

jamie-m-a commented 2 years ago

Hi @Stikus,

This one is still being investigated, we shall get back to you when a fix is in place.

Cheers, Jamie.

serge2016 commented 2 years ago

Hello! I faced with the same issue. Could you give us any ETA, please?

jamie-m-a commented 2 years ago

Hi @serge2016

Sorry for the delay on this. We have a couple of incidences where sorting has been affecting the VEP output, generally under specific circumstances. We believe it's related to the version of Perl being used and we are attempting to resolve all these sorting bugs together. I hope to have a more useful update on this for you soon - in the mean time thanks for your patience.

While I cannot guarantee it, it may be the case that using perl 5.14 could be a temporary workaround for the sorting bugs.

ntm commented 1 year ago

Hi,

isn't this simply because perl sort is not necessarily stable? As explained in perldoc -f sort , to guarantee stability you need to use sort 'stable' In the OP's example the order varies between splice_donor_variant and splice_acceptor_variant , but as can be seen these two consequences both have rank = 3: https://www.ensembl.org/info/docs/Doxygen/variation-api/Utils_2Constants_8pm_source.html (lines 456-492)

Therefore sorting by rank can list the consequences in varying order from run to run, as in: https://github.com/Ensembl/ensembl-vep/blob/5aa4f5c8bdcdeb4dc27ba114497601dacdc66e73/modules/Bio/EnsEMBL/VEP/OutputFactory.pm#L1247

Does adding use sort 'stable'; at the top of OutputFactory.pm solve the issue?

ikarus97 commented 1 year ago

Hi,

I've experienced the same issue, especially for variants that overlap with both acceptor site & donor site (e.g. a variant deleting entire short exon). Haven't done very extensive test, but adding use sort 'stable'; at the top of OutputFactory.pm seemed to work for my case (Thanks @ntm!) I wish the future VEP releases resolve this issue.

Cheers, In-Hee

Stikus commented 1 year ago

Nice to see that solution is found. I hope Ensembl team can add this fix to upcoming 110 release.

serge2016 commented 1 year ago

Hello! Any progress here?

Is there an MR with this fix?

ntm commented 1 year ago

I didn't take the time to clone the repo and submit a PR with https://github.com/Ensembl/ensembl-vep/issues/1066#issuecomment-1351132873 since it's a one-liner fix. But up to now only @ikarus97 has reported that it seems to fix the issue for him, @nuno-agostinho may be waiting for additional feedback. @serge2016 and @Stikus does it also fix it for you?

nuno-agostinho commented 1 year ago

Hey @Stikus, @ntm, @serge2016 and @ikarus97,

Sorry for the inconvenience.

We have decided to improve the consistency of internal ranking for different consequences: each consequence has its own rank now, so the random sorting cannot happen anymore. We also created some unit tests to help us adhere to this in the future.

Those changes will be released in 110 (to be released soon).

Thanks for reporting the issue and for the proposed solutions!

Kind regards, Nuno