geronimp / graftM

GraftM - Rapid community profiles from metagenomes
http://geronimp.github.io/graftM/
GNU General Public License v3.0
44 stars 16 forks source link

accept interleaved input #263

Closed jmeppley closed 4 years ago

jmeppley commented 5 years ago

I tweaked to code to allow interleaved input:

geronimp commented 5 years ago

Hey John thanks for this, its very helpful and makes graftM more useful. It all looks fine to me. Do you intend on adding more or can I merge.

wwood commented 5 years ago

Hi,

Thanks a lot for this John, good stuff.

I've added a few more tests in https://github.com/wwood/graftM/tree/jmeppley-interleaved and noticed that one actually fails legitimately. The search_otu_table.txt output file reports that 2 reads match using interleaved input, but 1 read when run with forward+reverse. @geronimp not sure if this is easily fixed?

There is also some added /1 and /2 in places that make the output different for interleaved output, but I think that probably makes sense because otherwise the two reads from a pair are put in a file together and therefore have the same name, which is badness.

I also had a quick and dirty benchmark, and it seems that the changes don't affect forward+reverse performance, though running interleaved is ~5% slower.

Thanks, ben

jmeppley commented 5 years ago

Thanks for filling out the test. I hadn't had a chance to put anything in beyond making sure it didn't spit out an error code.

As for the the search_otu_table.txt discrepancy, the /1 and /2 suffixes allow the reads to all go through the search step together in one file. They get removed in the merge step. If the interleaved is reporting extra hits than the forward/reverse run, then there is somewhere else where the suffixes need to be removed...I'll see if I can find it.

On Sun, May 19, 2019 at 11:05 PM Ben J Woodcroft notifications@github.com wrote:

Hi,

Thanks a lot for this John, good stuff.

I've added a few more tests in https://github.com/wwood/graftM/tree/jmeppley-interleaved and noticed that one actually fails legitimately. The search_otu_table.txt output file reports that 2 reads match using interleaved input, but 1 read when run with forward+reverse. @geronimp https://github.com/geronimp not sure if this is easily fixed?

There is also some added /1 and /2 in places that make the output different for interleaved output, but I think that probably makes sense because otherwise the two reads from a pair are put in a file together and therefore have the same name, which is badness.

I also had a quick and dirty benchmark, and it seems that the changes don't affect forward+reverse performance, though running interleaved is ~5% slower.

Thanks, ben

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/geronimp/graftM/pull/263?email_source=notifications&email_token=AANDP3DKIUNPVY4YM5JV7HDPWI5SNA5CNFSM4HNJOSKKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODVXY3MI#issuecomment-493850033, or mute the thread https://github.com/notifications/unsubscribe-auth/AANDP3ANW2RBB3T4ACA6QIDPWI5SNANCNFSM4HNJOSKA .

wwood commented 5 years ago

Hi @jmeppley thanks for following up on this. Do you imagine this is ready to merge now?

jmeppley commented 5 years ago

I think it's fairly solid, but I haven't had had a chance to look at the counting discrepancy yet.

On Thu, May 30, 2019, 10:16 PM Ben J Woodcroft notifications@github.com wrote:

Hi @jmeppley https://github.com/jmeppley thanks for following up on this. Do you imagine this is ready to merge now?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/geronimp/graftM/pull/263?email_source=notifications&email_token=AANDP3H6O7TYJ525BVZPE23PYCYEDA5CNFSM4HNJOSKKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWUHGIY#issuecomment-497578787, or mute the thread https://github.com/notifications/unsubscribe-auth/AANDP3E6Z4RD2UZO35S6C5LPYCYEDANCNFSM4HNJOSKA .

jmeppley commented 5 years ago

I'm looking into the counting discrepancy in search_otu_table.txt and I find a bug that is not affected by this pull request.

When I run a pair of test files through graftM using --forward and --reverse, I get these counts (whether I use my code or the original code):

(/data/jmeppley/tmp/graftm/env) [jmeppley@aspartate gt_paired]$ find . -type f -name "*fa" -exec grep -H -c "^>" {} \;
./CHBIR00-20a-S2C09-0025_S12_R1_001/forward/CHBIR00-20a-S2C09-0025_S12_R1_001_forward_hits.fa:125
./CHBIR00-20a-S2C09-0025_S12_R1_001/forward/CHBIR00-20a-S2C09-0025_S12_R1_001_forward_orf.fa:125
./CHBIR00-20a-S2C09-0025_S12_R1_001/forward/CHBIR00-20a-S2C09-0025_S12_R1_001_forward_hits.aln.fa:108
./CHBIR00-20a-S2C09-0025_S12_R1_001/reverse/CHBIR00-20a-S2C09-0025_S12_R1_001_reverse_hits.fa:100
./CHBIR00-20a-S2C09-0025_S12_R1_001/reverse/CHBIR00-20a-S2C09-0025_S12_R1_001_reverse_orf.fa:100
./CHBIR00-20a-S2C09-0025_S12_R1_001/reverse/CHBIR00-20a-S2C09-0025_S12_R1_001_reverse_hits.aln.fa:92
./CHBIR00-20a-S2C09-0025_S12_R1_001/CHBIR00-20a-S2C09-0025_S12_R1_001_hits.aln.fa:117
(/data/jmeppley/tmp/graftm/env) [jmeppley@aspartate gt_paired]$ cat search_otu_table.txt 
#ID     CHBIR00-20a-S2C09-0025_S12_R1_001
graftmyIPssM.aln        100

The search_otu_table.txt is reporting only the number of raw reverse hits, but labelling them "R1". I don't know why. When run with --interleaved, it reports the raw hit count from the interleaved reads:

(/data/jmeppley/tmp/graftm/env) [jmeppley@aspartate gt_interl]$ find . -type f -name "*fa" -exec grep -H -c "^>" {} \;
./CHBIR00-20a-S2C09-0025_S12/interleaved/CHBIR00-20a-S2C09-0025_S12_interleaved_hits.fa:221
./CHBIR00-20a-S2C09-0025_S12/interleaved/CHBIR00-20a-S2C09-0025_S12_interleaved_orf.fa:221
./CHBIR00-20a-S2C09-0025_S12/interleaved/CHBIR00-20a-S2C09-0025_S12_interleaved_hits.aln.fa:198
./CHBIR00-20a-S2C09-0025_S12/CHBIR00-20a-S2C09-0025_S12_hits.aln.fa:115
(/data/jmeppley/tmp/graftm/env) [jmeppley@aspartate gt_interl]$ cat search_otu_table.txt 
#ID     CHBIR00-20a-S2C09-0025_S12
graftmyIPssM.aln        221

I think you can merge this PR.