cfe-lab / MiCall

Pipeline for processing FASTQ data from an Illumina MiSeq to genotype human RNA viruses like HIV and hepatitis C
https://cfe-lab.github.io/MiCall
GNU Affero General Public License v3.0
14 stars 9 forks source link

Sample breaks csf2counts #86

Closed donkirkby closed 10 years ago

donkirkby commented 10 years ago

50791A-RT3-only-pos084-V3LOOP-2_S86 in run 130916. Here's the stack trace:

Traceback (most recent call last):
  File "/opt/eclipse44/plugins/org.python.pydev_3.6.0.201406232321/pysrc/pydevd.py", line 1845, in <module>
    debugger.run(setup['file'], None, None)
  File "/opt/eclipse44/plugins/org.python.pydev_3.6.0.201406232321/pysrc/pydevd.py", line 1373, in run
    pydev_imports.execfile(file, globals, locals)  # execute the script
  File "/mnt/data/don/git/MiseqPipeline/csf2counts.py", line 491, in <module>
    main()
  File "/mnt/data/don/git/MiseqPipeline/csf2counts.py", line 374, in main
    inserts)
  File "/mnt/data/don/git/MiseqPipeline/csf2counts.py", line 192, in write
    query_end = max(qindex_to_refcoord.keys())
ValueError: max() arg is an empty sequence

The error occurs when trying to process the INT region.

Update

Reopened because error still occurs when processing region HIV1B-gag in sample 46892B-3515-HLA-B-E99605CLIMX-PR-RT_S71 on 22 May 2014.

ArtPoon commented 10 years ago

No repro on my workstation.

ArtPoon commented 10 years ago

The problem seems to stem from using different versions of bowtie2. Here are excerpts from *.prelim.csv, with seq and qual fields omitted:

Art (bowtie 2.0.6)
M01841:22:000000000-A3RCA:1:1110:16290:10899,81,INT,743,0,151S100M,=,1625,0
M01841:22:000000000-A3RCA:1:1110:16290:10899,161,INT,1625,44,97M154S,=,743,0
M01841:22:000000000-A3RCA:1:2103:12283:26425,97,INT,523,0,62M189S,=,1397,0
M01841:22:000000000-A3RCA:1:2103:12283:26425,145,INT,1397,44,183S5M1I62M,=,523,0
M01841:22:000000000-A3RCA:1:2110:27229:19935,99,INT,865,11,110S141M,=,865,361
M01841:22:000000000-A3RCA:1:2110:27229:19935,147,INT,865,11,33S218M,=,865,-361
M01841:22:000000000-A3RCA:1:2111:12490:3895,83,INT,1498,0,115S136M,=,1498,366
M01841:22:000000000-A3RCA:1:2111:12490:3895,163,INT,1498,0,136M115S,=,1498,-366

Don (bowtie 2.2.3)
M01841:22:000000000-A3RCA:1:1110:16290:10899,81,INT,1622,0,151S100M,=,746,0
M01841:22:000000000-A3RCA:1:1110:16290:10899,161,INT,746,0,97M154S,=,1622,0
M01841:22:000000000-A3RCA:1:2103:12283:26425,97,INT,1402,0,62M189S,=,1397,0
M01841:22:000000000-A3RCA:1:2103:12283:26425,145,INT,1397,0,183S5M1I62M,=,1402,0
M01841:22:000000000-A3RCA:1:2110:27229:19935,99,INT,865,11,110S141M,=,865,361
M01841:22:000000000-A3RCA:1:2110:27229:19935,147,INT,865,11,33S218M,=,865,-361
M01841:22:000000000-A3RCA:1:2111:12490:3895,83,INT,1498,0,115S136M,=,1498,366
M01841:22:000000000-A3RCA:1:2111:12490:3895,163,INT,1498,0,136M115S,=,1498,-366

Note that the first three reads are being mapped to different positions in the INT ref. What is also strange is that several positions exceed the length of INT - and the answer is that the reference sequence from the lab has dollar signs in the sequence, and in fact is a concatenation of two INT sequences..

donkirkby commented 10 years ago

We seem to have the same stack trace when processing sample 46892B-3515-HLA-B-E99605CLIMX-PR-RT_S71 on 22 May 2014, so I'm reopening the issue. When I look at the dumped query sequence and the reference sequence for HIV1B-gag, the query sequence has a lot of question marks in it, and is shifted all the way past the end of the reference sequence. If I try to manually align it with the reference sequence, it looks like it should start after the seventh amino acid in the reference sequence.

ArtPoon commented 10 years ago

Fixed with commit 6849d56f09346fbc64e5f88a14b8b7324d8f292f