amplab / snap

Scalable Nucleotide Alignment Program -- a fast and accurate read aligner for high-throughput sequencing data
https://www.microsoft.com/en-us/research/project/snap/
Apache License 2.0
288 stars 66 forks source link

Don't require /1 /2 headers #142

Closed bwlang closed 2 years ago

bwlang commented 2 years ago

read headers from our instruments look like this:

@M01873:M01873:K27HF:1:1101:10000:22589 1:N:0:TTCCCGAA+TTGCGTTA
@M01873:M01873:K27HF:1:1101:10000:22589 2:N:0:TTCCCGAA+TTGCGTTA

I had to get things working like this: seqtk mergepe omicron_lh01_F12.1.fastq.gz omicron_lh01_F12.2.fastq.gz| gsed -r 's! ([12]):!/\1 \1:!' | ~/src/snap/snap-aligner paired index -pairedInterleavedFastq - -so -o lh01_snap.bam see also #132

bolosky commented 2 years ago

I think this is just an issue with paired, interleaved FASTQ. Since your input files are separate, have you tried:

~/src/snap/snap-aligner paired index omicron_lh01_F12.1.fastq.gz omicron_lh01_F12.2.fastq.gz -so -o lh01_snap.bam

Fewer steps and it should work.

I should also fix this. The paired interleaved FASTQ reader is set up to run multi-threaded to avoid bottlenecking on input. When it does that, it just seeks into the middle of the input file and needs to be able to determine whether the first read it sees is the first of second end of a pair, which is why it's looking for the /1 or /2. However, it does this even when it's running single threaded (because the input is compressed or from stdin) and there's no need there.

I should probably just turn off the multi-threaded mode altogether. At the very least, it shouldn't do the check for /1 /2 when it's not in multi-threaded mode.

From: Brad Langhorst @.> Sent: Monday, December 27, 2021 1:32 PM To: amplab/snap @.> Cc: Subscribed @.***> Subject: [amplab/snap] Don't require /1 /2 headers (Issue #142)

read headers from our instruments look like this:

@M01873:M01873:K27HF:1:1101:10000:22589 1:N:0:TTCCCGAA+TTGCGTTA

@M01873:M01873:K27HF:1:1101:10000:22589 2:N:0:TTCCCGAA+TTGCGTTA

I had to get things working like this: seqtk mergepe omicron_lh01_F12.1.fastq.gz omicron_lh01_F12.2.fastq.gz| gsed -r 's! ([12]):!/\1 \1:!' | ~/src/snap/snap-aligner paired index -pairedInterleavedFastq - -so -o lh01_snap.bam see also #132https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Famplab%2Fsnap%2Fissues%2F132&data=04%7C01%7Cbolosky%40microsoft.com%7C0708efb4123c4f3ebae408d9c9804629%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637762375053094759%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=AT81FabKFMp%2B3BOgH0GmWA45TUDV%2Fpj%2F5pVtZaQjYWg%3D&reserved=0

- Reply to this email directly, view it on GitHubhttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Famplab%2Fsnap%2Fissues%2F142&data=04%7C01%7Cbolosky%40microsoft.com%7C0708efb4123c4f3ebae408d9c9804629%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637762375053094759%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=sih403eCCM2KIgi%2BsFQUQQRy05ain%2B7dCBAbT%2BkVQGY%3D&reserved=0, or unsubscribehttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAAHPTWNLRB3CUSYP72CTUNTUTDLL3ANCNFSM5K272DWA&data=04%7C01%7Cbolosky%40microsoft.com%7C0708efb4123c4f3ebae408d9c9804629%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637762375053094759%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=r5jNXZrbhOkU%2Bsl56dlxZOjG6oLYR2pBwgYNiq1Yqc8%3D&reserved=0. Triage notifications on the go with GitHub Mobile for iOShttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fapps.apple.com%2Fapp%2Fapple-store%2Fid1477376905%3Fct%3Dnotification-email%26mt%3D8%26pt%3D524675&data=04%7C01%7Cbolosky%40microsoft.com%7C0708efb4123c4f3ebae408d9c9804629%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637762375053094759%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=udqP5mvqI04gN1P7Oo1Jx314CIfGNVWyuFDB%2FzSZxJg%3D&reserved=0 or Androidhttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fplay.google.com%2Fstore%2Fapps%2Fdetails%3Fid%3Dcom.github.android%26referrer%3Dutm_campaign%253Dnotification-email%2526utm_medium%253Demail%2526utm_source%253Dgithub&data=04%7C01%7Cbolosky%40microsoft.com%7C0708efb4123c4f3ebae408d9c9804629%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637762375053094759%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=dRkjK3f7OutqxQOgRjXAcK%2FWel0UaRFri7f32RP0K9o%3D&reserved=0. You are receiving this because you are subscribed to this thread.Message ID: @.**@.>>

bwlang commented 2 years ago

I tried that first (without interleaving) but I get this on stdout:

Welcome to SNAP version 2.0.0.

GzipDataReader: inflate failed with -3
-3 is Z_DATA_ERROR, indicating a corrupt input.  zstream->availIn is 0x3feffe:
0x0      30 31 38 37 33 3a 4d 30 31 38 37 33 3a 4b 32 37 48 46 3a 31 3a 31 31 30 31 3a 31 30 30 30 30 3a
...

I assumed that was related to #132, but maybe not... the fastq.gz files are symlinks in case that's relevant.

I think the files are ok since seqtk can consume them.

You could look for 1: and 2: instead of /1 and /2 maybe

I can share the fastq.gz files if you want

bolosky commented 2 years ago

That's not related to #132. That's just that we don't take compressed FASTA for index building (which we should, we just never implemented it).

I've never seen the error you're reporting unless the input file was corrupt. I'd be happy to look at your file and see if I can repro it.

If you also share the reference then we can look at your other issue (with the soft clipping), too.

--Bill

From: Brad Langhorst @.> Sent: Monday, December 27, 2021 6:12 PM To: amplab/snap @.> Cc: Bill Bolosky @.>; Comment @.> Subject: Re: [amplab/snap] Don't require /1 /2 headers (Issue #142)

I tried that first (without interleaving) but I get this on stdout:

Welcome to SNAP version 2.0.0.

GzipDataReader: inflate failed with -3

-3 is Z_DATA_ERROR, indicating a corrupt input. zstream->availIn is 0x3feffe:

0x0 30 31 38 37 33 3a 4d 30 31 38 37 33 3a 4b 32 37 48 46 3a 31 3a 31 31 30 31 3a 31 30 30 30 30 3a

...

I assumed that was related to #132https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Famplab%2Fsnap%2Fissues%2F132&data=04%7C01%7Cbolosky%40microsoft.com%7C1256a1864b524933536f08d9c9a779ea%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637762543416527798%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=ObYctreSEQP%2BVrBxqyBVZsPNJacIddeYeOcLy0uqow0%3D&reserved=0, but maybe not... the fastq.gz files are symlinks in case that's relevant.

I think the files are ok since seqtk can consume them.

You could look for 1: and 2: instead of /1 and /2 maybe

I can share the fastq.gz files if you want

- Reply to this email directly, view it on GitHubhttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Famplab%2Fsnap%2Fissues%2F142%23issuecomment-1001832560&data=04%7C01%7Cbolosky%40microsoft.com%7C1256a1864b524933536f08d9c9a779ea%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637762543416527798%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=jzyNn%2FYcZ%2F8k6Xo%2F3H1bezMmf%2Fd%2FfT9wuKO1hPP3Thg%3D&reserved=0, or unsubscribehttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAAHPTWIHIPAYNMISRS4GEDLUTEMIFANCNFSM5K272DWA&data=04%7C01%7Cbolosky%40microsoft.com%7C1256a1864b524933536f08d9c9a779ea%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637762543416527798%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=VApOuhUQMgDkQA96V%2F8RcbWR6KPSNhyOBPjy2%2F9zcCs%3D&reserved=0. Triage notifications on the go with GitHub Mobile for iOShttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fapps.apple.com%2Fapp%2Fapple-store%2Fid1477376905%3Fct%3Dnotification-email%26mt%3D8%26pt%3D524675&data=04%7C01%7Cbolosky%40microsoft.com%7C1256a1864b524933536f08d9c9a779ea%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637762543416577787%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=uk54jMAxCzFbzHMJHs12XsELi%2BQjudRTvAhPvaeywls%3D&reserved=0 or Androidhttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fplay.google.com%2Fstore%2Fapps%2Fdetails%3Fid%3Dcom.github.android%26referrer%3Dutm_campaign%253Dnotification-email%2526utm_medium%253Demail%2526utm_source%253Dgithub&data=04%7C01%7Cbolosky%40microsoft.com%7C1256a1864b524933536f08d9c9a779ea%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637762543416577787%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=WxY4cLUnogs9bUJmNnyyefj7TPX1nU1wJWIk9cJcHTg%3D&reserved=0. You are receiving this because you commented.Message ID: @.**@.>>

bwlang commented 2 years ago

sure - here's the data https://neb-collaborations.s3.amazonaws.com/lh01_snap_test.tar.gz

bolosky commented 2 years ago

Your input files, while they have a .gz extension, aren't compressed. They're just plain ASCII FASTQ files. Hence SNAP's confusion.

If for some reason your pipeline makes uncompressed files with .gz extensions (which it shouldn't), you can always use -fastq to override SNAP's parsing of the file extension to determine the type.

--Bill

From: Brad Langhorst @.> Sent: Monday, December 27, 2021 6:36 PM To: amplab/snap @.> Cc: Bill Bolosky @.>; Comment @.> Subject: Re: [amplab/snap] Don't require /1 /2 headers (Issue #142)

sure - here's the data https://neb-collaborations.s3.amazonaws.com/lh01_snap_test.tar.gzhttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fneb-collaborations.s3.amazonaws.com%2Flh01_snap_test.tar.gz&data=04%7C01%7Cbolosky%40microsoft.com%7Cb2d125a86c6b4779510208d9c9aad339%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637762558398592584%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=NUYxVDlhHRn8Hbrq%2BoAldfkoP1pgDiTPWGiq2Lsn8Hg%3D&reserved=0

- Reply to this email directly, view it on GitHubhttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Famplab%2Fsnap%2Fissues%2F142%23issuecomment-1001837989&data=04%7C01%7Cbolosky%40microsoft.com%7Cb2d125a86c6b4779510208d9c9aad339%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637762558398592584%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=itAhdMQR9NTEh44UovoIiGGrI8jD4T6IEHzir%2FRzMvc%3D&reserved=0, or unsubscribehttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAAHPTWKENKXF2ZZXJDZZGI3UTEPCBANCNFSM5K272DWA&data=04%7C01%7Cbolosky%40microsoft.com%7Cb2d125a86c6b4779510208d9c9aad339%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637762558398592584%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=G%2FRMMlivQE1T%2B0psI3GSwMWyKKasfPkE4Afrbs6wpaY%3D&reserved=0. Triage notifications on the go with GitHub Mobile for iOShttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fapps.apple.com%2Fapp%2Fapple-store%2Fid1477376905%3Fct%3Dnotification-email%26mt%3D8%26pt%3D524675&data=04%7C01%7Cbolosky%40microsoft.com%7Cb2d125a86c6b4779510208d9c9aad339%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637762558398592584%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=AwBCmcJLdsOahC1kkYcKtYWQRA795Qsd6dFsbJtLXbI%3D&reserved=0 or Androidhttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fplay.google.com%2Fstore%2Fapps%2Fdetails%3Fid%3Dcom.github.android%26referrer%3Dutm_campaign%253Dnotification-email%2526utm_medium%253Demail%2526utm_source%253Dgithub&data=04%7C01%7Cbolosky%40microsoft.com%7Cb2d125a86c6b4779510208d9c9aad339%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637762558398642532%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=O6zPkbzz%2F5AxrRzICiu61RVPWpPprX68Hi0TYEuuXDo%3D&reserved=0. You are receiving this because you commented.Message ID: @.**@.>>

bwlang commented 2 years ago

Ah - so they are! I can confirm that things work as 2 arguments when I recompress them. Thanks!

Back to the original topic... I suspect it would be good not to assume a specific read header format for the interleaved input case. Read from SRA and other sources could cause trouble.

bolosky commented 2 years ago

I pushed a fix for it to the dev branch.

Do:

git checkout dev make clean make

And you should get a SNAP binary with version number 2.0.1.dev.0. It should no longer care about the read name format for paired interleaved FASTQ (though in some situations where you have lots of cores and/or very fast per-read alignments it may be slower).

Let me know if it solves your problem.

--Bill

From: Brad Langhorst @.> Sent: Monday, December 27, 2021 7:25 PM To: amplab/snap @.> Cc: Bill Bolosky @.>; Comment @.> Subject: Re: [amplab/snap] Don't require /1 /2 headers (Issue #142)

Ah - so they are! I can confirm that things work as 2 arguments when I recompress them. Thanks!

Back to the original topic... I suspect it would be good not to assume a specific read header format for the interleaved input case. Read from SRA and other sources could cause trouble.

- Reply to this email directly, view it on GitHubhttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Famplab%2Fsnap%2Fissues%2F142%23issuecomment-1001849151&data=04%7C01%7Cbolosky%40microsoft.com%7Cc9db173b4b0542024b0a08d9c9b19d59%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637762586960643242%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=v5sh8R0EaAelR5YbNaiCuXj1bvKESotCi%2BUYhKFg%2F08%3D&reserved=0, or unsubscribehttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAAHPTWLWFOQAZSYKXCOO4EDUTEUYLANCNFSM5K272DWA&data=04%7C01%7Cbolosky%40microsoft.com%7Cc9db173b4b0542024b0a08d9c9b19d59%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637762586960693681%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=HxobnWJaIXVm3%2FbNtmgs6OdT2zYrlfRMEJ3z7821vTw%3D&reserved=0. Triage notifications on the go with GitHub Mobile for iOShttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fapps.apple.com%2Fapp%2Fapple-store%2Fid1477376905%3Fct%3Dnotification-email%26mt%3D8%26pt%3D524675&data=04%7C01%7Cbolosky%40microsoft.com%7Cc9db173b4b0542024b0a08d9c9b19d59%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637762586960693681%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=jWIVZ6lzpgTLGe3%2FDpjh8abLUUlyJHCkUcBGBuih%2BzM%3D&reserved=0 or Androidhttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fplay.google.com%2Fstore%2Fapps%2Fdetails%3Fid%3Dcom.github.android%26referrer%3Dutm_campaign%253Dnotification-email%2526utm_medium%253Demail%2526utm_source%253Dgithub&data=04%7C01%7Cbolosky%40microsoft.com%7Cc9db173b4b0542024b0a08d9c9b19d59%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637762586960693681%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=EnXHFCMvhuOMqLCy9R0YBdEAv1ufJSP%2FKgbGunt4Qeo%3D&reserved=0. You are receiving this because you commented.Message ID: @.**@.>>

bwlang commented 2 years ago

I can confirm that the dev branch version is able to align interleaved reads on a pipe without the /1 /2 sed hack. Thanks for the quick action and sorry for the gz/non-gz distraction!

fixed by a4ee7154bb76afa0fc7aa2aa6718e84293dc837b