broadinstitute / warp

WDL Analysis Research Pipelines
https://broadinstitute.github.io/warp
BSD 3-Clause "New" or "Revised" License
200 stars 93 forks source link

parsing error in WGS single sample workflow #1331

Open ekiernan opened 1 month ago

ekiernan commented 1 month ago

This was posted by an external user:

I am harmonizing WGS data for the GREGoR consortium using the WGS single sample workflow (in dragen_mode) on AnVIL. I've hit a parsing error when reprocessing a subset of the consortium data (see attached log file).

In brief, the data, which throws the error, was similarly pre-processed and uploaded to AnVIL in CRAM format. I successfully converted (passed ValidateSam) these crams to ubams and used this as input for the WGS single sample workflow. As you can see in the log, the error comes up during alignment with DRAGMAP. It first prints "When maskLen < 15, the function ssw_align doesn't return 2nd best alignment information." and then throws a parsing error.

Could you please take a look and let me know if this looks familiar or have any additional insights on how to troubleshoot.

Update: Are there processing steps that are incompatible with REprocessing with WARP. For instance, I know that for these samples, reads were trimmed and duplicates dropped in the initial processing. Could this be causing the error...

ekiernan commented 1 month ago

We asked the User to:

Focus on is the line "[W::sam_read1] Parse error at line 204289". That's an error that comes from samtools and seems to me that there is something wrong with the input file. It's impossible to say without having the data available, but given that it prints out the line number it shouldn't be too hard to figure out what's odd about that particular line. If they don't want to run the entire file then they can just run it on a subset of the reads, with that problematic line included of course.

The User replied and noticed:

"Read pairs are different lengths which is why I asked about trimming. I can dig to see if this happens consistently across all failures.

Parsing error was for read -

A01139:184:HK5MGDSX7:1:1101:28013:4131 A01139:184:HK5MGDSX7:1:1101:28013:4131 77 * 0 0 * * 0 0 CTTCTCCATCCAAAGGAATGTTCAGCTCTGTGTGTTAAACTCAATCATCACAAAGTATTTTCTGAGAATGCTTCTGTCTAGATTTTATGTGAAGCTCTTCCCTTTACTACCATAGGCCTCAAAGCGCTCCAAATCTCCACTAGCCGATTCT ?BDBC@?AA@?ABBAAABA?@B@AA@A@A@@@A@ACABC>BABCBABBAB?BCCBA;BCCCABABBBCBAABCABAAABABBBCCCABAAABCBABBBCAAACDCA@CA?@CCBBBBACACDDCB;BCBACDDCBCBAC@CBCBA;DDDBB RG:Z:RGL001 XS:i:151

A01139:184:HK5MGDSX7:1:1101:28013:4131 141 * 0 0 * * 0 0 GTAGAATCGGCTAGTGGAGATTTGGAGCGCTTTGAGGCCTATGGTAGTAAAGGGAAGAGCTTCACATAAAATCTAGACAGAAGCATTCTCAGAAAATACTTTGTGATGATTGAGTTTAACACACAGAGCTGAACATTCCTTTGGATGG >@ABC;@@9A?@?@@?AAA@@AA?@@A@9@AB;@AAA@?A@A@A@@A@@BBAAAA9AAA@AB@A>AA@BBBA@A@AA>AAABA@AAB@A@AAABBBA@=ABB@@@A8@@AA?AA@BB@B>@=@=@@@6?@?@A=@AB@>@BB@5@@@@ RG:Z:RGL001 XS:i:148 "

The User also mentioned that the mismatch in read pairs isn't consistent across parsing errors.

michaelgatzen commented 1 month ago

Are we sure that this is the line that causes issues? I don't know if "line 204289" refers to the line number with or without header lines. What happens if they only pass that read pair into the aligner? Does it fail?

If not, does it fail if they pass plus/minus 10,000 read pairs around line 204289 into the aligner?

kachulis commented 1 month ago

the other thing I'd suggest is to split apart the individual parts in the dragmap --> samtools --> mergebamalignments pipe in SamToFastqAndDragmapAndMba and run each piece separately. So just run dragmap on the whole file first, and then if that completes successfully, run the resulting file through samtools view. Presumably that will hit this error, given the samtools view error message mentioned in the logs you emailed. At that point it should be easier to look at the read causing the issue and see if there is some obvious problem. Note that the reads mentioned above are pre-alignment, as in the reads that went into dragmap, not what actually went into samtools, so probably not exactly the data in the form it was when it caused the error.

If there's nothing obvious there, the fastest way to find the solution is probably to run the data through samtools in a debugger. Depending on the user's familiarity with the samtools source code and/or software debugging in general, that may or may not be feasible. If they get to the debugging step but don't feel that's something they can do, we could try to take a look if we can find the time, if they are able to share the dragmap output bam with us (that's a big "if").

kachulis commented 1 month ago

ultimately, this will probably end up pointing us to some bug in dragmap, though, so will likely either need a workaround, or illumina to fix

mmwheel commented 1 month ago

Thanks for the comments/suggestions. I've passed only the read pair corresponding to line 204289 to the workflow as well as slices of the ubam (including one with X read pairs around line 204289). All these attempts aligned successfully. After looking at log files across ubam shards and across samples the parsing error comes up around line 200K which feels like resource limitation to me.