Ensembl / ensembl-vep

The Ensembl Variant Effect Predictor predicts the functional effects of genomic variants
https://www.ensembl.org/vep
Apache License 2.0
453 stars 151 forks source link

haplosaurus tool switched into uninterruptible sleep #214

Closed vivekruhela closed 6 years ago

vivekruhela commented 6 years ago

Hi,

I was curious in protein sequence from patient data. According to this question, I tried this tool to get protein sequence. But, here, I am facing a new problem. After calling this tool using the command as shown below :

./haplo -i /mnt/storage/MM_Data/SM_5_WES/Variant-Calling/SM_5.updated_dbsnp.vcf.gz -o /mnt/storage/MM_Data/SM_5_WES/Variant-Calling/SM_5.haplosauras.txt -offline --dir_cache /home/ensembl-vep/homo_sapiens/

./haplo script runs very well in starting but after some time, this process goes into uninterruptible sleep mode. I have confirmed this two times using htop in my server. When this happen first time, I thought, system is (may be) hanged due to heavy operations (because many other operations are running in parallel like gatk etc.) and taking a lot of time. So I interrupted it (after 48 hrs in first attempt) and run it again (by stopping all other parallel processes and executed only this command) when I checked the results after 24 hrs, it was stucked at the same warnings and was in uninterruptible sleep mode again. I don't know what cause this because any process switched into uninterruptible sleep only when there is any problem in data I/O and other processes are working well in my server. Does this mean that operation is complete or any other problem in module.

And during the operation, this tool gives many warning like: WARNING: genomic coord 51239295-51239309 possibly maps across coding/non-coding boundary in ENST00000375992 Any suggestions.......

vivekruhela commented 6 years ago

Please add mode word after each uninterruptible sleep. The process goes into uninterruptible sleep mode. Sorry for this. Thanks.

vivekruhela commented 6 years ago

Hey, anyone here. I hope my issue is acknowledged by expert team. Any suggestions.......haplosauraus with or without json option goes into uninterruptible sleep mode after sometime. I have tried and confirmed this many time. What can be possible reasons for this. Thanks

at7 commented 6 years ago

Hello, we will be looking into the problem. Could you please give us some more information about your input data? What is the type of variants, how many variants and how many genotypes or individuals are stored in the input VCF file? Thank you

sarahhunt commented 6 years ago

Hi @vivekruhela.

Your stackexchange query says your objective is to annotate variants with SIFT. You can do this with VEP, without calculating protein sequences. Do you have it installed? VEP takes a variant list (in VCF or other formats) as input and provides SIFT results alongside similar tools such as PolyPhen2, REVEL, CADD, etc. Have a look at the 'Pathogenicity predictions' section on web tool here: http://www.ensembl.org/Multi/Tools/VEP. You need to enable dbNSFP to see a fuller list.

The error message you are seeing from haplo suggests your input file contains long variants overlapping multiple exons. These should be skipped rather than cause the process to hang, so it is not clear what is going wrong. As Anja said, any further information you can provide on your input data would be helpful.

vivekruhela commented 6 years ago

@at7 : Thanks for reply. In my commnd line, I have shown in my issue, it is one patient vcf file containing 456840 variants which include several variant types such as nsSNV, exonic, UTR etc. @sarahhunt : In my stackexchange post, I was interested in protein sequence of each patient. So I got suggestion to use haplosauraus with json to get protein sequence. I have used ANNOVAR to get functional significance score and I have used VEP to determine the effects of variants. I have not tried VEP to get scores. Let me know if I missed anything. Thanks.

at7 commented 6 years ago

If you want protein sequences and functional significance scores you can use VEP and add options from https://www.ensembl.org/info/docs/tools/vep/script/vep_options.html#output:

Using Haplosaurus: Haplosaurus takes phased genotypes from a VCF and constructs a pair of haplotype sequences for each overlapped transcript; these sequences are also translated into predicted protein haplotype sequences. Each variant haplotype sequence is aligned and compared to the reference, and an HGVS-like name is constructed representing its differences to the reference.

Can you please confirm that your input VCF file contains phased genotypes? I couldn't reproduce the error you get when running Haplosaurus. It could also be related to the operating system you are using. Could you please also give us details about the type and version of your operating system?

vivekruhela commented 6 years ago

@at7 : I think yes. My vcf file is phased. I am posting some lines of my vcf file to let you confirm again:

chrM 150 . T C 1668.77 PASS ABHom=1;AC=2;AF=1;AN=2;DP=58;ExcessHet=3.0103;FS=0;GQ_MEAN=172;HRun=1;MLEAC=2;MLEAF=1;MQ=60;NCC=0;OND=0;QD=28.77;SOR=1.278;VariantType=SNP GT:AD:DP:GQ:MLPSAC:MLPSAF:PL 1/1:0,58:58:99:2:1:1697,172,0 chrM 195 . C T 2528.77 PASS ABHom=1;AC=2;AF=1;AN=2;DP=57;ExcessHet=3.0103;FS=0;GQ_MEAN=181;HRun=1;MLEAC=2;MLEAF=1;MQ=60;NCC=0;OND=0;QD=34.24;SOR=0.765;VariantType=SNP GT:AD:DP:GQ:MLPSAC:MLPSAF:PGT:PID:PL 1/1:0,56:56:99:2:1:1|1:195_C_T:2557,181,0 chrM 199 rs72619362 T C 2462.77 PASS ABHom=1;AC=2;AF=1;AN=2;DB;DP=58;ExcessHet=3.0103;FS=0;GQ_MEAN=175;HRun=1;MLEAC=2;MLEAF=1;MQ=60;NCC=0;OND=0;QD=30.63;SOR=0.877;VariantType=SNP GT:AD:DP:GQ:MLPSAC:MLPSAF:PGT:PID:PL 1/1:0,57:57:99:2:1:1|1:195_C_T:2491,175,0 chrM 302 . AC A 673.9 PASS AC=1;AF=0.5;AN=2;BaseQRankSum=-3.369;ClippingRankSum=0;DP=53;ExcessHet=3.0103;FS=37.781;GQ_MEAN=14;HRun=8;MLEAC=1;MLEAF=0.5;MQ=60;MQRankSum=0;NCC=0;QD=18.21;ReadPosRankSum=2.49;SOR=2.547;VariantType=DELETION.NumRepetitions_8.EventLength_1.RepeatExpansion_C GT:AD:DP:GQ:MLPSAC:MLPSAF:PL 0/1:7,30:37:14:1:0.5:711,0,14 chrM 410 . A T 2225.77 PASS ABHom=1;AC=2;AF=1;AN=2;DP=76;ExcessHet=3.0103;FS=0;GQ_MEAN=228;HRun=3;MLEAC=2;MLEAF=1;MQ=60;NCC=0;OND=0;QD=29.29;SOR=0.693;VariantType=SNP GT:AD:DP:GQ:MLPSAC:MLPSAF:PL 1/1:0,76:76:99:2:1:2254,228,0 chrM 491 . T C 698.77 PASS ABHom=1;AC=2;AF=1;AN=2;DP=26;ExcessHet=3.0103;FS=0;GQ_MEAN=77;HRun=0;MLEAC=2;MLEAF=1;MQ=60;NCC=0;OND=0;QD=26.88;SOR=2.67;VariantType=SNP GT:AD:DP:GQ:MLPSAC:MLPSAF:PL 1/1:0,26:26:77:2:1:727,77,0 chrM 2354 . C T 1336.77 PASS ABHom=1;AC=2;AF=1;AN=2;DP=53;ExcessHet=3.0103;FS=0;GQ_MEAN=153;HRun=1;MLEAC=2;MLEAF=1;MQ=59.91;NCC=0;OND=0;QD=25.71;SOR=1.358;VariantType=SNP GT:AD:DP:GQ:MLPSAC:MLPSAF:PL 1/1:0,52:52:99:2:1:1365,153,0 chrM 2485 . C T 1114.77 PASS ABHom=1;AC=2;AF=1;AN=2;DP=42;ExcessHet=3.0103;FS=0;GQ_MEAN=124;HRun=0;MLEAC=2;MLEAF=1;MQ=43.82;NCC=0;OND=0;QD=26.54;SOR=1.127;VariantType=SNP GT:AD:DP:GQ:MLPSAC:MLPSAF:PL 1/1:0,42:42:99:2:1:1143,124,0 chrM 5581 . C T 603.77 PASS ABHom=1;AC=2;AF=1;AN=2;DP=22;ExcessHet=3.0103;FS=0;GQ_MEAN=65;HRun=0;MLEAC=2;MLEAF=1;MQ=34.26;NCC=0;OND=0;QD=27.44;SOR=6.273;VariantType=SNP GT:AD:DP:GQ:MLPSAC:MLPSAF:PL 1/1:0,22:22:65:2:1:632,65,0 chrM 9378 . G A 185.8 PASS ABHom=1;AC=2;AF=1;AN=2;DP=7;ExcessHet=3.0103;FS=0;GQ_MEAN=21;HRun=0;MLEAC=2;MLEAF=1;MQ=42.61;NCC=0;OND=0;QD=26.54;SOR=0.941;VariantType=SNP GT:AD:DP:GQ:MLPSAC:MLPSAF:PL 1/1:0,7:7:21:2:1:214,21,0 chrM 10401 . C T 1832.77 PASS ABHom=1;AC=2;AF=1;AN=2;DP=63;ExcessHet=3.0103;FS=0;GQ_MEAN=190;HRun=0;MLEAC=2;MLEAF=1;MQ=57.98;NCC=0;OND=0;QD=29.09;SOR=1.573;VariantType=SNP GT:AD:DP:GQ:MLPSAC:MLPSAF:PL 1/1:0,63:63:99:2:1:1861,190,0 Currently I am working on Linux server with 40 threads and dual core, 96GB RAM and 1 TB internal and 4 TB external storage, Ubuntu operating system. Thanks.

at7 commented 6 years ago

Your input data is not phased. Phased data uses '|' as a separator and not '/'. For example: phased would be 1|1, unphased is 1/1. You can find more information here: https://samtools.github.io/hts-specs/VCFv4.1.pdf You could try running a phasing algorithm and then rerun Haplosaurus on your phased data.

vivekruhela commented 6 years ago

@at7 : My bad. Sorry for this confusion. I thought '/' stands for phased. I'll definitely try phasing algorithm and haplosaurus again.Thanks.

at7 commented 6 years ago

No worries. I will close the ticket for now.