Closed jmeppley closed 4 years ago
Hey John thanks for this, its very helpful and makes graftM more useful. It all looks fine to me. Do you intend on adding more or can I merge.
Hi,
Thanks a lot for this John, good stuff.
I've added a few more tests in https://github.com/wwood/graftM/tree/jmeppley-interleaved and noticed that one actually fails legitimately. The search_otu_table.txt
output file reports that 2 reads match using interleaved input, but 1 read when run with forward+reverse. @geronimp not sure if this is easily fixed?
There is also some added /1
and /2
in places that make the output different for interleaved output, but I think that probably makes sense because otherwise the two reads from a pair are put in a file together and therefore have the same name, which is badness.
I also had a quick and dirty benchmark, and it seems that the changes don't affect forward+reverse performance, though running interleaved is ~5% slower.
Thanks, ben
Thanks for filling out the test. I hadn't had a chance to put anything in beyond making sure it didn't spit out an error code.
As for the the search_otu_table.txt discrepancy, the /1 and /2 suffixes allow the reads to all go through the search step together in one file. They get removed in the merge step. If the interleaved is reporting extra hits than the forward/reverse run, then there is somewhere else where the suffixes need to be removed...I'll see if I can find it.
On Sun, May 19, 2019 at 11:05 PM Ben J Woodcroft notifications@github.com wrote:
Hi,
Thanks a lot for this John, good stuff.
I've added a few more tests in https://github.com/wwood/graftM/tree/jmeppley-interleaved and noticed that one actually fails legitimately. The search_otu_table.txt output file reports that 2 reads match using interleaved input, but 1 read when run with forward+reverse. @geronimp https://github.com/geronimp not sure if this is easily fixed?
There is also some added /1 and /2 in places that make the output different for interleaved output, but I think that probably makes sense because otherwise the two reads from a pair are put in a file together and therefore have the same name, which is badness.
I also had a quick and dirty benchmark, and it seems that the changes don't affect forward+reverse performance, though running interleaved is ~5% slower.
Thanks, ben
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/geronimp/graftM/pull/263?email_source=notifications&email_token=AANDP3DKIUNPVY4YM5JV7HDPWI5SNA5CNFSM4HNJOSKKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODVXY3MI#issuecomment-493850033, or mute the thread https://github.com/notifications/unsubscribe-auth/AANDP3ANW2RBB3T4ACA6QIDPWI5SNANCNFSM4HNJOSKA .
Hi @jmeppley thanks for following up on this. Do you imagine this is ready to merge now?
I think it's fairly solid, but I haven't had had a chance to look at the counting discrepancy yet.
On Thu, May 30, 2019, 10:16 PM Ben J Woodcroft notifications@github.com wrote:
Hi @jmeppley https://github.com/jmeppley thanks for following up on this. Do you imagine this is ready to merge now?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/geronimp/graftM/pull/263?email_source=notifications&email_token=AANDP3H6O7TYJ525BVZPE23PYCYEDA5CNFSM4HNJOSKKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWUHGIY#issuecomment-497578787, or mute the thread https://github.com/notifications/unsubscribe-auth/AANDP3E6Z4RD2UZO35S6C5LPYCYEDANCNFSM4HNJOSKA .
I'm looking into the counting discrepancy in search_otu_table.txt
and I find a bug that is not affected by this pull request.
When I run a pair of test files through graftM using --forward and --reverse, I get these counts (whether I use my code or the original code):
(/data/jmeppley/tmp/graftm/env) [jmeppley@aspartate gt_paired]$ find . -type f -name "*fa" -exec grep -H -c "^>" {} \;
./CHBIR00-20a-S2C09-0025_S12_R1_001/forward/CHBIR00-20a-S2C09-0025_S12_R1_001_forward_hits.fa:125
./CHBIR00-20a-S2C09-0025_S12_R1_001/forward/CHBIR00-20a-S2C09-0025_S12_R1_001_forward_orf.fa:125
./CHBIR00-20a-S2C09-0025_S12_R1_001/forward/CHBIR00-20a-S2C09-0025_S12_R1_001_forward_hits.aln.fa:108
./CHBIR00-20a-S2C09-0025_S12_R1_001/reverse/CHBIR00-20a-S2C09-0025_S12_R1_001_reverse_hits.fa:100
./CHBIR00-20a-S2C09-0025_S12_R1_001/reverse/CHBIR00-20a-S2C09-0025_S12_R1_001_reverse_orf.fa:100
./CHBIR00-20a-S2C09-0025_S12_R1_001/reverse/CHBIR00-20a-S2C09-0025_S12_R1_001_reverse_hits.aln.fa:92
./CHBIR00-20a-S2C09-0025_S12_R1_001/CHBIR00-20a-S2C09-0025_S12_R1_001_hits.aln.fa:117
(/data/jmeppley/tmp/graftm/env) [jmeppley@aspartate gt_paired]$ cat search_otu_table.txt
#ID CHBIR00-20a-S2C09-0025_S12_R1_001
graftmyIPssM.aln 100
The search_otu_table.txt
is reporting only the number of raw reverse hits, but labelling them "R1". I don't know why. When run with --interleaved, it reports the raw hit count from the interleaved reads:
(/data/jmeppley/tmp/graftm/env) [jmeppley@aspartate gt_interl]$ find . -type f -name "*fa" -exec grep -H -c "^>" {} \;
./CHBIR00-20a-S2C09-0025_S12/interleaved/CHBIR00-20a-S2C09-0025_S12_interleaved_hits.fa:221
./CHBIR00-20a-S2C09-0025_S12/interleaved/CHBIR00-20a-S2C09-0025_S12_interleaved_orf.fa:221
./CHBIR00-20a-S2C09-0025_S12/interleaved/CHBIR00-20a-S2C09-0025_S12_interleaved_hits.aln.fa:198
./CHBIR00-20a-S2C09-0025_S12/CHBIR00-20a-S2C09-0025_S12_hits.aln.fa:115
(/data/jmeppley/tmp/graftm/env) [jmeppley@aspartate gt_interl]$ cat search_otu_table.txt
#ID CHBIR00-20a-S2C09-0025_S12
graftmyIPssM.aln 221
I think you can merge this PR.
I tweaked to code to allow interleaved input: