amplab / snap

Scalable Nucleotide Alignment Program -- a fast and accurate read aligner for high-throughput sequencing data
https://www.microsoft.com/en-us/research/project/snap/
Apache License 2.0
287 stars 66 forks source link

feature request: option to move fastq comments to sam tags #146

Closed eboyden closed 1 year ago

eboyden commented 2 years ago

Hi, I was wondering if an option to move the fastq comment (everything after the first whitespace, including additional whitespaces) to a field in the output sam file could be added. If the fastq comments were properly formatted as tab-delimited sam tags, then this would result in a properly formatted sam file with all metadata preserved. This is similar functionality to the BWA MEM -C and Bowtie2 --sam-append-comment options, and is useful for e.g. transferring UMI information from fastqs to sam files without relying on additional files and formatting steps. Thanks!

matthdsm commented 2 years ago

Hi,

is there an update for this issue? We've got some fastq's with UMI data in sam comments we'd like to be able to align.

Thanks

bolosky commented 2 years ago

I haven't done any work on it, but it's probably not all that hard to do. It may have a performance effect because it will most likely require dynamic memory allocation, which we studiously avoid in the main loop, but since it's optional that's OK. I imagine that any degradation would be far less than a post-processing stage.

I'll think about it more.

matthdsm commented 2 years ago

Thanks for the reply! Great to know you're keeping it in mind. Looking forward to what the next release might bring 😄

bolosky commented 2 years ago

How important is it for you to have this supported for BAM output? With SAM I can just copy the string, but for BAM I'd have to parse it to break it down into individual tags, which is a pain.

I'm planning on having it rely on the input being correct like BWA-mem does, so if you feed it a FASTQ file with comments that don't look like SAM optional fields it will just blindly copy them and generate broken SAM. I presume that's also OK.

eboyden commented 2 years ago

Definitely okay to blindly copy like BWA and Bowtie2 do; user beware if they misformat the fastq comment. It would be preferable if it could output BAM; if you just blindly copy the entire string whitespace and all, requiring the user to have formatted it properly, do you still have to parse the individual tags?

bolosky commented 2 years ago

Yeah, I have to parse for BAM. BAM isn't just compressed SAM, it's got a little more structure, and part of that is that each tag has its own element.

Not that big of a deal, I'd just hoped to avoid doing it.

eboyden commented 2 years ago

Well beggars can't be choosers. I'd love BAM output but I'm not the one doing the work, and there are workarounds (SAM>BAM) if you decide it's too much effort and not high enough priority. Much appreciate whatever you're willing to do.

bolosky commented 2 years ago

I've already put in the effort to plumb it through the parsing is only a few hours to implement. And using samtools for SAM->BAM is often slower than FASTQ->BAM with SNAP, so not having it kind of defeats the point.

bolosky commented 1 year ago

How much do you need the "B" (binary array) tag type? It's kind of a pain to convert to BAM, so I'm inclined to leave it unimplemented unless you need it.

eboyden commented 1 year ago

Not only have I never needed it, I'm not sure I've ever even seen a tag of that format. So I say don't bother. If someone needs it down the road, it can be addressed then.

matthdsm commented 1 year ago

Same here!

bolosky commented 1 year ago

I just pushed the code for this into the dev branch. Check it out and let me know if it works for you.

To get the behavior, you'll need to include the new -pfc (preserve FASTQ comments) switch. Otherwise, you get the old behavior.

It's in 2.0.2.dev.4.

bolosky commented 1 year ago

Please let me know if this works. We're thinking about pushing it to master, but it would be nice to have feedback before we do.

eboyden commented 1 year ago

Yes, sorry - will do so by Monday

eboyden commented 1 year ago

It seems to work great when outputting to SAM. When outputting to BAM, however, it runs to completion without error but produces a corrupt file; I get this error message when attempting to read the file with Samtools:

[E::sam_format1_append] Corrupted aux data for read MN01688:12:000H3MG5N:1:11102:4839:1176
samtools view: writing to standard output failed: Invalid argument
bolosky commented 1 year ago

It works on my machine with my input/index (not that the index is likely to matter).

Would you be willing to send me an example fastq to try?

eboyden commented 1 year ago

Interleaved fastq plus a (good) sam and a (corrupt) bam (which I had to zip to get it to upload) NTC.pairedInterleaved.fastq.gz NTC.pairedInterleaved.sam.gz NTC.pairedInterleaved.bam.gz

bolosky commented 1 year ago

Try 2.0.2.dev.7. Turns out I didn't fix the used buffer size properly for paired read output. I'd only tested it for single, which worked fine.

eboyden commented 1 year ago

Interestingly, that small example I uploaded ran fine; however a substantially larger fastq did not, and samtools threw a similar error as before:

Error reading input file
[E::bam_aux_get] Corrupted aux data for read MN01688:12:000H3MG5N:1:12108:23198:14782
[E::bam_aux_get] Corrupted aux data for read MN01688:12:000H3MG5N:1:12108:23198:14782
eboyden commented 1 year ago

It seems like in certain situations, especially on larger files, SNAP is just silently producing a truncated bam (regardless of whether either -so or -pfc are turned on or not). Often this bam apparently includes an EOF marker as the bam passes samtools quickcheck. I have anecdotally heard from another user that he's encountered this phenomenon as well (on version 2.0.0), especially on larger BAMs (but for whatever reason it does not seem to be an issue when outputting SAMs, again regardless of either -so or -pfc).

bolosky commented 1 year ago

Well, that sounds bad.

If someone can send me an example input, I'll look and see what's happening.

Of course, I didn't write the BAM code and the person who did doesn't work here anymore. But I mostly understand it, I think.

From: eboyden @.> Sent: Tuesday, October 4, 2022 5:47 PM To: amplab/snap @.> Cc: Bill Bolosky @.>; Comment @.> Subject: Re: [amplab/snap] feature request: option to move fastq comments to sam tags (Issue #146)

It seems like in certain situations, especially on larger files, SNAP is just silently producing a truncated bam (regardless of whether either -so or -pfc are turned on or not). Often this bam apparently includes an EOF marker as the bam passes samtools quickcheck. I have anecdotally heard from another user that he's encountered this phenomenon as well (on version 2.0.0), especially on larger BAMs (but for whatever reason it does not seem to be an issue when outputting SAMs, again regardless of either -so or -pfc).

- Reply to this email directly, view it on GitHubhttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Famplab%2Fsnap%2Fissues%2F146%23issuecomment-1267774094&data=05%7C01%7Cbolosky%40microsoft.com%7Ca7e5f1b6887f4071229708daa66b202f%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C638005276280297972%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=VYqHbrqDH9QG1JQYACZ9A9XlFZtIIzmtE9W0%2FGvUWKE%3D&reserved=0, or unsubscribehttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAAHPTWLUGVA7RTUPBETY6PTWBTFYTANCNFSM5LLLNQ4Q&data=05%7C01%7Cbolosky%40microsoft.com%7Ca7e5f1b6887f4071229708daa66b202f%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C638005276280297972%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=S0APzcFZcQiGmdjncQ95T4lbCeDOvDl5pcXjuQmo234%3D&reserved=0. You are receiving this because you commented.Message ID: @.**@.>>

eboyden commented 1 year ago

My input file that initially failed to produce bam output with -pfc and 2.0.2.dev.6, but worked with 2.0.2.dev.7, was only 1154 read pairs. My file that continued to fail (but worked with sam output) was 10M read pairs (20M reads). Truncating this to 5M, 2.5M, or 1.25M read pairs all failed with or without -pfc (regardless of whether they were from the top or bottom of the file), but 1M read pairs (and fewer) worked (again regardless of whether it was the first or last 1M). Aligning a fastq with 10M SE reads also worked. So insofar as I'm able to test, it seems like -pfc is working properly, and this other apparent bam output bug is likely unrelated.

bolosky commented 1 year ago

There was a bug in the -pfc code that didn't deal properly with tags crossing the end of the output buffer, which I fixed.

Much more interestingly, in looking for it I discovered a bug in BAM output that's existed in SNAP forever that would corrupt output when a compression block was larger than the uncompressed data. Presumably, we didn't see this very often because even a single SAM line is pretty compressible and so wouldn't tickle the bug. I only hit it because I was using a reference with only one contig, which sometimes generated a header that didn't compress (depending on the command line, which made this kind of tricky to debug).

Check it out this time and see if it works for you.

eboyden commented 1 year ago

It did indeed - I am impressed and grateful. I tried recapitulating the bug using publicly available GIAB fastqs and was unable to do so, and unfortunately I wasn't permitted to share my fastqs that repeatedly failed, so I wasn't sure how to better enable you to find it yourself. It had been suggested to me that I specify the hardware I was using, especially that it's a Linux server, in case it only affected the Linux distribution, but I guess that wasn't the case?

In any case, as far as I can tell, bam output is now working fine with or without -so, as is -pfc.

bolosky commented 1 year ago

You probably couldn't get it to happen with GIAB. I have aligned all of the GIAB samples many, many times and never saw it. I also aligned (pretty much) all of TCGA several times and also didn't see it.

It would only happen when a block compressed to a size larger than the input. I got it to happen with the header when I aligned with a reference with only one contig (which I made because my dev machine doesn't have enough memory). Even then, it sometimes happened and sometime didn't depending on what the command line was (because the @PG is in the header and affects compression). This was somewhat frustrating.

The data blocks will always include at least one whole read (or two for paired-end), and since SEQ is drawn from an alphabet of (usually) four characters, it will always compress pretty well. So, unless you have a really unusual read data blocks will never have this problem.

And it has nothing to do with Linux/Windows, it was in the machine independent code.

bolosky commented 1 year ago

In 2.0.2