BilkentCompGen / hercules

Profile HMM-based hybrid error correction algorithm for long reads
BSD 3-Clause "New" or "Revised" License
20 stars 4 forks source link

Error running Hercules #10

Open sagnikbanerjee15 opened 5 years ago

sagnikbanerjee15 commented 5 years ago

Hello,

I mapped all my paired-ended short reads to the erroneous long reads using BWA aligner. I used markdup from samtools to remove duplicates from the alignment file. I provided the new bam file as input to hercules but the program kept failing. This is the error I keep getting. Please help me with this error.

Long Read: /work/LAS/rpwise-lab/sagnik/PacBio_SBR/data/7-Dai_S1-ns_S4-ns_053017_40/A01_1/Analysis_Results/A01_1_fungal_subreads.fasta Short read: /work/LAS/rpwise-lab/sagnik/PacBio_SBR/data/7-Dai_S1-ns_S4-ns_053017_40/A01_1/Analysis_Results/hercules_output_directory/preprocessing/short.fasta Alignment File: /work/LAS/rpwise-lab/sagnik/PacBio_SBR/data/7-Dai_S1-ns_S4-ns_053017_40/A01_1/Analysis_Results/hercules_output_directory/output_alignment.bam Output file: /work/LAS/rpwise-lab/sagnik/PacBio_SBR/data/7-Dai_S1-ns_S4-ns_053017_40/A01_1/Analysis_Results/hercules_output_directory/corrected_long.fasta Will output coverage?: No Min mapping quality: 0 Max Coverage: 1 Filter size: 100 Maximum insertion: 3 Maximum deletion: 10 Match transition probability: 0.7 Insertion transition probability: 0.25 Deletion transition probability: 0.05 Deletion transition factor: 2.5 Match emission probability: 0.97 Mismatch emission probability: 0.01 Insertion emission probability: 0.333333 Max thread: 16 Correction has begun... terminate called after throwing an instance of 'std::invalid_argument' what(): stoi

Here are the first few alignments from the file which I provided hercules with.

M01479:193:000000000-B8T8G:1:1103:14560:9768 2145 m170530_201923_42275_c101193402550000001823271609291700_s1_p0/10/7993_9349 1240 0 236H21M1D16M1I18M8H m170530_201923_42275_c101193402550000001823271609291700_s1_p0/10/10133_10782 543 0 AACACCATAGTGAGGGAAACCACCCATTGGGGTAATGTTTTTCTCGGTCCTATCAA FF=EE8CGGGGDCAFGEG56=CEGF=:0:>CFCFG+;C5C+;<C>=BGF9: NM:i:2 MD:Z:21^A34 MC:Z:12H57M231H AS:i:41 XS:i:27 SA:Z:m170530_201923_42275_c101193402550000001823271609291700_s1_p0/116808/5334_5835,284,+,45S52M1I52M2D7M2D19M1D15M2D43M66S,0,15;m170530_201923_42275_c101193402550000001823271609291700_s1_p0/27722/18865_19096,30,-,217S71M1D12M,0,2; M01479:213:000000000-BHVWB:1:1106:27562:8160 2121 m170530_201923_42275_c101193402550000001823271609291700_s1_p0/10/7993_9349 1240 21 244H21M1D17M1I17M = 1240 0 AACACCATAGTGACGGAAACCACCCATTGGGGTAATGTCTATCTCGGTCCTATCAA ,,4=EG=,:,4,=+/1==8::5/:C7<+1)CCDFG0:>:>71CD5CF46/C NM:i:4 MD:Z:13G7^A18T15 AS:i:31 XS:i:18 SA:Z:m170530_201923_42275_c101193402550000001823271609291700_s1_p0/116808/3121_3648,250,+,10S15M1D11M5D11M1D8M1D7M1D23M1I52M1D3M2D24M1D10M1D24M1I28M1D14M58S,21,25; HWI-ST412:228:C0E8MACXX:7:2216:11095:94480 177 m170530_201923_42275_c101193402550000001823271609291700_s1_p0/10/7993_9349 1240 0 6S16M1D22M1I21M34S m170530_201923_42275_c101193402550000001823271609291700_s1_p0/10/17791_18435 107 0 CTTAACAACACCATAGTGAGGGAACCCACCCATTGGGGTAATGTCTTTCTCGGTCCTATCAACCTCAGTAAGAGCAGAGTGAGATTCATCTCCAGCTTTG ADC?9?82>4AC>@@?3@53983'976;;E?AAD@E@E@F<GGBHDIIGHFD??AGD?)FDIIJJIIIGIFFFCHHFAHEEGB?BHAF??D=FB@8 NM:i:3 MD:Z:16^A4A38 AS:i:40 XS:i:40 MQ:i:21 MC:Z:11S46M1I17M2D25M ms:i:3378 HWI-ST412:228:C0E8MACXX:7:2312:1797:95693 128 m170530_201923_42275_c101193402550000001823271609291700_s1_p0/10/9398_10080 88 25 4M1D50M1I26M19S 0 0 ATTACCCCAATAGGTGGTTTCCCTCACTATGGTGTTGTTAAGGATGATTATATCATGGTAAAGGGATGCTGCGTTGGTCCCAAGAAAAGGGTTATTACAT ?B@DBDDFHHHFHGGII@CFHIHGIDGEGEGICFHFHE?CFDGFH@3?BFFGIIJCAF)=@GGH=;CGHG7=?=8?CCE@?A5(,9>@82<9::>4:> NM:i:7 MD:Z:4^C7G37T0A8G10T9 MC:Z:41S33M1D13M13S AS:i:44 XS:i:40 XA:Z:m170530_201923_42275_c101193402550000001823271609291700_s1_p0/10/12892_13532,-492,29S43M2D5M1I22M,6;m170530_201923_42275_c101193402550000001823271609291700_s1_p0/10/17791_18435,+107,29S46M1I16M8S,4; M01479:193:000000000-B8T8G:1:2102:27010:8568 2179 m170530_201923_42275_c101193402550000001823271609291700_s1_p0/10/9398_10080 93 6 5H50M1I26M1D9M2D7M1D8M1D33M1D13M146H m170530_201923_42275_c101193402550000001823271609291700_s1_p0/23529/11583_12033 304 0CCCCAATAGGTGGTTTCCCTCACTATGGTGTTGTTAAGGATGATTATATCATGGTCAAGGGATGCTGCGTTGGTCCCAAGAAAAGGGTTATTACATTGCGCCAATCCCTCCTTAAGCAGACTTCTCGTGTTGCTCTGGAGGAGATCA GFGGGGGGGGAFGGGGGGGGGGGEFGGGGGGGGGGGGGFFGGGGGGGGAFGCFFGGGGFFCCEFFDAFGGFC,CDFFCGF?FGFGDCGGGGGFGGGGF9=FGCCFGFCGFDGGD<EEGFDEF<BEFEG?FFGGGGGAFF=FEEGGGG NM:i:17 MD:Z:7G37T0A7A0G10T9^G5G3^GT2C4^G2G5^T33^G2T10 MC:Z:52M1I10M1I44M109H AS:i:53 XS:i:46 SA:Z:m170530_201923_42275_c101193402550000001823271609291700_s1_p0/23529/11583_12033,303,-,80S53M1I10M1I44M109S,19,5; XA:Z:m170530_201923_42275_c101193402550000001823271609291700_s1_p0/10/12892_13532,-492,226S43M2D5M1I23M,5; M01479:193:000000000-B8T8G:1:2113:28847:9646 161 m170530_201923_42275_c101193402550000001823271609291700_s1_p0/10/9398_10080 93 2 11S49M1I27M1D9M2D7M1D7M1D34M1D12M14S m170530_201923_42275_c101193402550000001823271609291700_s1_p0/10/17791_18435 107 0GAAAGACATTACCCCAATGGGTGGTTTACCTCCCTATGGTGTTGTTAAGGATGCTTTTATCCTGGTCAAGGGATGCTGTGTTGGTCCCCAGTAGAGGGTCATTACATTGCGTCAATCCCTCCTTAAGCAGACTTCTCGTGTTGCTCTTGAGGAGATCCAACTCACGTTTAT A8A<-C-=;C-=CFF@@<6,6;6@,C<C6C,,,8;,<C<6;CFGF,,CEC@@;,6,C<<,C,C6,CFGAG8:,,B@F<C<CE,6@,9,,:B6,,,,9@BFC6F9,6C,9,BC+BB,54,,=C@5,5E,C4<,,95,C?F8FC8,E<FGGG8,,4EF,,8<:,A,,448A@F NM:i:16 MD:Z:16C4A20A3A7A0G20^G0A2A5^GT7^G2G4^C34^G12 AS:i:57 XS:i:53 XA:Z:m170530_201923_42275_c101193402550000001823271609291700_s1_p0/116808/3121_3648,-364,22M3D27M1D10M1D23M1I26M62S,11;m170530_201923_42275_c101193402550000001823271609291700_s1_p0/10/12892_13532,-490,91S45M2D7M1I27M,6; MQ:i:19 MC:Z:36S44M1I19M2D25M3I9M1D34M ms:i:3921 M01479:213:000000000-BHVWB:1:2106:9239:13766 2177 m170530_201923_42275_c101193402550000001823271609291700_s1_p0/10/9398_10080 93 5 235H47M19H m170530_201923_42275_c101193402550000001823271609291700_s1_p0/116808/3121_3648 298 0 CCCCAATAGGGGGTTTCCCTCACTGTGGTGTTGTTAAGGATGATTTA E9@0910)=CD+31>:/7+2+)2.)877)A@***)+ NM:i:3 MD:Z:7G2T13A22 MC:Z:4M1D7M1D23M1I52M3D27M1D10M1D24M1I28M1D14M110H AS:i:32 XS:i:28 SA:Z:m170530_201923_42275_c101193402550000001823271609291700_s1_p0/45364/1448_2680,552,+,48S8M1D45M1I7M1I18M1I2M1I5M1I14M1I3M1I8M1I3M1I18M113S,5,14; XA:Z:m170530_201923_42275_c101193402550000001823271609291700_s1_p0/10/12892_13532,-493,40M2D7M1I35M218S,11; M01479:193:000000000-B8T8G:1:2102:27010:8568 2163 m170530_201923_42275_c101193402550000001823271609291700_s1_p0/10/9398_10080 93 6 5H50M1I26M1D9M2D7M1D8M1D33M1D13M65H m170530_201923_42275_c101193402550000001823271609291700_s1_p0/23529/11583_12033 303 0CCCCAATAGGTGGTTTCCCTCACTATGGTGTTGTTAAGGATGATTATATCATGGTCAAGGGATGCTGCGTTGGTCCCAAGAAAAGGGTTATTACATTGCGCCAATCCCTCCTTAAGCAGACTTCTCGTGTTGCTCTGGAGGAGATCA GGFGFGGGGGGGGFGGEF@FFFGFGGGGGGGGGGGGGGGGGGGGGGGGGGFDGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGDGGGCFGGCGGGGGGFGGGGGGGGFFDGEGEFGGFGGFFGFGGGF<GCGGG NM:i:17 MD:Z:7G37T0A7A0G10T9^G5G3^GT2C4^G2G5^T33^G2T10 MC:Z:80H53M1I10M1I44M109H AS:i:53 XS:i:46 SA:Z:m170530_201923_42275_c101193402550000001823271609291700_s1_p0/23529/11583_12033,304,+,52M1I10M1I44M109S,18,5; XA:Z:m170530_201923_42275_c101193402550000001823271609291700_s1_p0/10/12892_13532,+492,145S43M2D5M1I23M,5; M01479:193:000000000-B8T8G:1:2105:18189:15115 2161 m170530_201923_42275_c101193402550000001823271609291700_s1_p0/10/9398_10080 93 0 75H50M1I26M1D7M2D9M1D8M1D33M1D13M78H m170530_201923_42275_c101193402550000001823271609291700_s1_p0/137563/28414_29073 358 0 CCCCAATAGGTGGTTTCCCTCACTATGGTGTTGTTAAGGATGATTATATCATGGTCAAGGGATGCTGCGTTGGTCCCAAGAAAAGTGTTATTACATTGCGCCAATCCCTCCTTAAGCAGACTTCTCGTGTTGCTCTTGAGGAGATCA E89E6,:2@FGD,CGGGFB9@,CF:B>+D7EGGFF;FC,FE=C<9F=E<GGFC<FDCFFFFDCBGFF=GFGGFCFFFCFCF9GEE?GFFCGFDGGGGECB,=E<F77EF=E?C,GFFC,EEGGEGGGGFGGGFC@C,FCEDEGGFE< NM:i:16 MD:Z:7G37T0A7A0G10T9^G5G1^GG4C4^G2G5^T33^G13 MC:Z:85H26M1D11M2I71M1I40M1D65M AS:i:58 XS:i:57 SA:Z:m170530_201923_42275_c101193402550000001823271609291700_s1_p0/23529/11583_12033,300,+,9S56M1I10M1I44M179S,24,4; XA:Z:m170530_201923_42275_c101193402550000001823271609291700_s1_p0/10/10133_10782,+504,163S38M1D56M43S,7;m170530_201923_42275_c101193402550000001823271609291700_s1_p0/10/12892_13532,+492,158S43M2D5M1I35M58S,6; M01479:193:000000000-B8T8G:1:2116:16308:25105 2209 m170530_201923_42275_c101193402550000001823271609291700_s1_p0/10/9398_10080 94 0 38M263H m170530_201923_42275_c101193402550000001823271609291700_s1_p0/116808/2746_3077 2 0 CTCAATGGGTGGTTTCCCTCACTATGGTGTTGTTAAGG --6A-CC9-CF78DF9,;@C8EE<FA<EFGG,EFCEF? NM:i:1 MD:Z:1C36 MC:Z:18H38M1I39M1D26M1D9M1D15M1D15M1D5M1D15M1D7M1I7M1I27M3I58M15H AS:i:36 XS:i:29 SA:Z:m170530_201923_42275_c101193402550000001823271609291700_s1_p0/116808/5334_5835,284,-,79S49M1I54M118S,0,9;m170530_201923_42275_c101193402550000001823271609291700_s1_p0/10/17791_18435,107,+,24S46M1I17M2D25M3I9M1D40M136S,0,19; XA:Z:m170530_201923_42275_c101193402550000001823271609291700_s1_p0/10/17086_17740,-534,263S29M9S,0;

Thank you.

canfirtina commented 5 years ago

Dear Sagnik,

Thanks for using Hercules. You should align compressedshort.fasta to compressed A01_1_fungal_subreads.fasta . Both of these files should be present in your preprocessing folder unless you specified "-noCompression" option. We do not suggest running hercules with an uncompressed data as it will dramatically reduce the error correction performance of hercules. noCompression option is usually for experimental purposes. Do you have specific reasons to align the uncompressed set of short reads to the long reads? (e.g., insufficient disk space etc.)

Thanks,

sagnikbanerjee15 commented 5 years ago

Hello,

Thanks for your reply. I wish to skip the mapping step since I already have mapped the short reads to erroneous long reads. This step takes quite a while since there are a very large number of short reads. So I tried running the second step of hercules providing it with the bam file from a previous alignment.

Thank you.

canfirtina commented 5 years ago

Unfortunately that will not work without modifying your alignment file, which may be done in two steps. First, regardless of whether you choose noCompression option or not, what Hercules does in preprocessing step is basically it changes the read ids in fasta/fastq files such that the read ids contain only some digits, not any alphabetical/special characters, and it saves the compressed/uncompressed sequences along with these numeric read ids to the preprocessing folder. These numbers represent the order of the reads as they appear in the actual long read file. For example the new read id of the first read in your long read fasta/fastq file will be "0", the id of the second read will be "1" and so on... Thus, you should change the read ids present in "REF" field so that REF field will only contain the numbers that will "point" to the actual long read in your original fasta file. Pointing means writing the order of these reads instead of their actual read ids. Hercules will find the actual read by looking at that number. That is why you are getting "stoi" error as it tries to convert the read id to an integer (this is one of the assumptions that Hercules makes for some optimization reasons). Second, are these reads paired-end? If so, you should merge the paired-end short reads into one file. Then you should rename the ids again of this file so that the read ids will only contain digits based on their order, again. You should also change the "QNAME"s in the alignment file, similarly. You should be careful about paired alignment as you used BWA-MEM (it writes the paired-end alignment information to "FLAG" to tell you that if it was the secondary alignment or not). I would suggest you to generate the preprocessing files using a very small set of reads to see how the fasta files that Hercules generates look like to get the general idea about this renaming step.

Hercules handles all these steps that I explained above by itself in the preprocessing step and ensures that everything looks fine after the alignment step as well. I've already got some feedbacks about the flexibility of the preprocessing step and I intend to make the preprocessing step more flexible so that when one runs into a situation like you are experiencing right now, Hercules should still be able to correct the reads as much as it can...

sagnikbanerjee15 commented 5 years ago

Thank you very much. I will try it out. It would be great if you could provide a method to convert bam files to a valid format in a future release.