amplab / snap

Scalable Nucleotide Alignment Program -- a fast and accurate read aligner for high-throughput sequencing data
https://www.microsoft.com/en-us/research/project/snap/
Apache License 2.0
287 stars 66 forks source link

why is -ku not default #39

Closed wm75 closed 9 years ago

wm75 commented 9 years ago

The latest behavior change to drop reads that "appear" unmapped does not seem logical to me. Reads with unspecified RNEXT and PNEXT are not strange, but occurring in unmapped SAM files all the time.

You recommended using unmapped SAM files not long ago in https://github.com/amplab/snap/issues/11 and we prefer them generally over fastq format since it allows us to store metadata about sequencing runs in the file itself.

If anything then SNAP should inspect the FLAG fields of reads to tell whether they're paired or not ?

bolosky commented 9 years ago

-ku isn’t the default because for files with unmatched reads, or with reads that whose mate pairs are very far apart, using it can cause memory use to explode. I think that it’s probably less problematic to make people using unmapped SAM files specify –ku than to have folks run out of memory and have SNAP crash, so I’m inclined to leave it the way it is. I realize that neither solution is all that good.

As to your problems with MacOS/OSX, I’ll try to find a Mac I can use to test it and figure out what’s going on. We don’t have many of them here at Microsoft, and unlike Linux machines you can’t just spin up a VM in the cloud.

--Bill

From: Wolfgang Maier [mailto:notifications@github.com] Sent: Monday, December 8, 2014 7:06 AM To: amplab/snap Subject: [snap] why is -ku not default (#39)

The latest behavior change to drop reads that "appear" unmapped does not seem logical to me. Reads with unspecified RNEXT and PNEXT are not strange, but occurring in unmapped SAM files all the time.

You recommended using unmapped SAM files not long ago in #11https://github.com/amplab/snap/issues/11 and we prefer them generally over fastq format since it allows us to store metadata about sequencing runs in the file itself.

If anything then SNAP should inspect the FLAG fields of reads to tell whether they're paired or not ?

— Reply to this email directly or view it on GitHubhttps://github.com/amplab/snap/issues/39.

wm75 commented 9 years ago

I see. Guess I have to learn to get used to -ku then, but that's fine. Just out of curiosity: why wasn't that a problem with older versions? Or was it and I just never ran into it ?

We don't have that many Macs either and only recently upgraded two to Yosemite - that's why it took us a long time too see this. Hope you can reproduce it.

bolosky commented 9 years ago

The problem is that if you’re trying to match read ends, if you don’t see a match, you have to hang on to the first end until the end of the alignment before you give up. This is fine if there aren’t unmatched reads (like in your unaligned SAM file case), but some input files have lots of them. If PNEXT/RNEXT are filled in and the input file is sorted, then once you get past the place where the read should have its mate, if it’s not there then you can drop the first end and save its memory.

I’m sure I’ll eventually come up with a mac somewhere that I can use. They probably have one at Berkeley.

--Bill

From: Wolfgang Maier [mailto:notifications@github.com] Sent: Monday, December 8, 2014 9:29 AM To: amplab/snap Cc: Bill Bolosky Subject: Re: [snap] why is -ku not default (#39)

I see. Guess I have to learn to get used to -ku then, but that's fine. Just out of curiosity: why wasn't that a problem with older versions? Or was it and I just never ran into it ?

We don't have that many Macs either and only recently upgraded two to Yosemite - that's why it took us a long time too see this. Hope you can reproduce it.

— Reply to this email directly or view it on GitHubhttps://github.com/amplab/snap/issues/39#issuecomment-66152006.