geronimp / graftM

GraftM - Rapid community profiles from metagenomes
http://geronimp.github.io/graftM/
GNU General Public License v3.0
44 stars 16 forks source link

Combined alignment file intermittently produced #254

Closed babakshaban closed 5 years ago

babakshaban commented 6 years ago

Hi,

I have run two paired end samples in GraftM but only occasionally am able to produce a combined alignment file.

GraftM recognises the files as paired end but doesn't perform the combined aligning stage.

I don't use the --no_merge_reads flag even though I am using paired end reads.

Here is the command I am using

graftM graft --forward ${fastq1} --reverse ${fastq2} --graftm_package ${refPackage} --output_directory eg.graftm --threads ${graftMThreads}

I am using wdl to create a pipeline and the reads are in the variables mentioned above. The forward and reverse are aligned individually but they don't seem to merge.

babakshaban commented 6 years ago

Also, is there any way to obtain the results of OrfM?

maglau commented 6 years ago

Has this issue been solved? I also have the same problem. I have multiple sets of paired files, and only some of them work. I changed the original F and R read files, and they look ordinary.

maglau commented 6 years ago

Has this issue been solved? I also have the same problem. I have multiple sets of paired files, and only some of them work. I checked the original F and R read files, and they look ordinary.

maglau commented 6 years ago

Has this issue been solved? I also have the same problem. I have multiple sets of paired files, and only some of them work. I checked the original F and R read files, and they look ordinary.

wwood commented 6 years ago

Hi there, Thanks for your interest in GraftM, and apologies for the slow reply. Is there some set of reads that we can use to repeat this error for us locally? ta, ben

maglau commented 6 years ago

Hi Ben,

Thank you for following up.

I am doing some batch analysis, and the sometimes an IO error happened when analyzing paired files. For some, I tested that graftM graft was able to generate results from F reads and R reads alone (using “--forward” option in both cases); but not when using “--forward” and “--reverse” to analyze F and R reads in a single command.

I am sharing with you the input files, the output directory, and the standard output containing the error message: IOError: [Errno 2] No such file or directory: 'mgm4536383.3.postQC_pairs_graftM/mgm4536383.3.postQC_pairs_F/mgm4536383.3.postQC_pairs_F_hits.aln.fa'

Please find the following files at http://tigress-web.princeton.edu/~maglau

-rw-r--r--. 1 maglau geo 2239449105 Nov 26 00:10 mgm4536383.3.postQC_pairs_F.fastq drwxr-sr-x. 3 maglau geo 512 Nov 26 00:14 mgm4536383.3.postQC_pairs_graftM/ -rw-r--r--. 1 maglau geo 2238048355 Nov 26 00:10 mgm4536383.3.postQC_pairs_R.fastq -rw-r--r--. 1 maglau geo 2186 Nov 26 00:16 std_output_errormsg

Cheers, Maggie

On Nov 26, 2018, at 7:32 AM, Ben J Woodcroft notifications@github.com<mailto:notifications@github.com> wrote:

Hi there, Thanks for your interest in GraftM, and apologies for the slow reply. Is there some set of reads that we can use to repeat this error for us locally? ta, ben

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/geronimp/graftM/issues/254#issuecomment-441483952, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AUBJ29YHhpEcm7art790Qoe_xuFzoee0ks5uyyiJgaJpZM4VIiu0.

Maggie C.Y. Lau, PhD Visiting Collaborator Department of Geosciences B80 Guyot Hall Princeton University Princeton, NJ 08544, US

Professor Laboratory of Extraterrestrial Ocean Systems Institute of Deep-Sea Science and Engineering, Chinese Academy of Sciences No. 28, Luhuitou Road, Sanya 572000, Hainan Province, P.R. China Cell: +86 13034986006 (China) +1 609-356-8145 (WhatsApp & WeChat)

wwood commented 6 years ago

Hi again, Would you be able to share or point to the GraftM package that was used too sorry? ta

maglau commented 6 years ago

Do you think it was specific to our GraftM package? It ran fine on >100 datasets. How could I share with you the package privately?

On Nov 29, 2018, at 4:23 AM, Ben J Woodcroft notifications@github.com<mailto:notifications@github.com> wrote:

Hi again, Would you be able to share or point to the GraftM package that was used too sorry? ta

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/geronimp/graftM/issues/254#issuecomment-442591789, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AUBJ2yfE-ov3KrD5wkNfFMtyZ79g-wt2ks5uzvDVgaJpZM4VIiu0.

Maggie C.Y. Lau, PhD Visiting Collaborator Department of Geosciences B80 Guyot Hall Princeton University Princeton, NJ 08544, US

Professor Laboratory of Extraterrestrial Ocean Systems Institute of Deep-Sea Science and Engineering, Chinese Academy of Sciences No. 28, Luhuitou Road, Sanya 572000, Hainan Province, P.R. China Cell: +86 13034986006 (China) +1 609-356-8145 (WhatsApp & WeChat)

wwood commented 6 years ago

Hi, Sorry to be a pain. I'm not sure if the error is specific or not, but running GraftM as close to the way you've done is probably a good way to debug the underlying issue.

You could try sharing it with me via email or if that is too big via google drive or dropbox? Try the email address you get by putting uq.edu.au after b.woodcroft

Thanks.

wwood commented 6 years ago

Hi again, I just tried it with an McrA graftm package, and got this:

11/30/2018 08:46:25 AM INFO: Working on mgm4536383.3.postQC_pairs_F
11/30/2018 08:46:25 AM INFO: Working on forward reads
11/30/2018 08:46:52 AM INFO: 1 read(s) detected
11/30/2018 08:46:52 AM INFO: aligning reads to reference package database
11/30/2018 08:46:52 AM INFO: Filtered 1 short sequences from the alignment
11/30/2018 08:46:52 AM INFO: 0 sequences remaining
11/30/2018 08:46:52 AM INFO: No more aligned sequences to place!
11/30/2018 08:46:52 AM INFO: Working on reverse reads
11/30/2018 08:47:19 AM INFO: 1 read(s) detected
11/30/2018 08:47:19 AM INFO: aligning reads to reference package database
11/30/2018 08:47:19 AM INFO: Filtered 0 short sequences from the alignment
11/30/2018 08:47:19 AM INFO: 1 sequences remaining
Traceback (most recent call last):
  File "/gnu/store/qcbyp8ixqhb72y4ykjqrn3xwr9g5i0v2-graftm-0.11.1/bin/..graftM-real-real", line 409, in <module>
    Run(args).main()
  File "/gnu/store/qcbyp8ixqhb72y4ykjqrn3xwr9g5i0v2-graftm-0.11.1/lib/python2.7/site-packages/graftm/run.py", line 588, in main
    self.graft()
  File "/gnu/store/qcbyp8ixqhb72y4ykjqrn3xwr9g5i0v2-graftm-0.11.1/lib/python2.7/site-packages/graftm/run.py", line 483, in graft
    seqs_list=clusterer.cluster(seqs_list, REVERSE_PIPE)
  File "/gnu/store/qcbyp8ixqhb72y4ykjqrn3xwr9g5i0v2-graftm-0.11.1/lib/python2.7/site-packages/graftm/clusterer.py", line 82, in cluster
    reads=self.seqio.read_fasta_file(input_fasta) # Read in FASTA records
  File "/gnu/store/qcbyp8ixqhb72y4ykjqrn3xwr9g5i0v2-graftm-0.11.1/lib/python2.7/site-packages/graftm/sequence_io.py", line 49, in read_fasta_file
    for name, seq, _ in self.each(open(path_to_fasta_file)):
IOError: [Errno 2] No such file or directory: 'paired_bug_test/mgm4536383.3.postQC_pairs_F/mgm4536383.3.postQC_pairs_F_hits.aln.fa'

So, what is happening here is that a single read is detected in both directions, but then one is being filtered out, which causes the error. This is a bug in GraftM and should be fixed. Are you observing something similar?

Thanks.

maglau commented 5 years ago

Hi Ben,

Yes, I got similar error: Traceback (most recent call last): File "/tigress/zgarvin/bin/graftM", line 409, in Run(args).main() File "/home/zgarvin/local/lib/python2.7/site-packages/graftm/run.py", line 588, in main self.graft() File "/home/zgarvin/local/lib/python2.7/site-packages/graftm/run.py", line 483, in graft seqs_list=clusterer.cluster(seqs_list, REVERSE_PIPE) File "/home/zgarvin/local/lib/python2.7/site-packages/graftm/clusterer.py", line 82, in cluster reads=self.seqio.read_fasta_file(input_fasta) # Read in FASTA records File "/home/zgarvin/local/lib/python2.7/site-packages/graftm/sequence_io.py", line 49, in read_fastafile for name, seq, in self.each(open(path_to_fasta_file)): IOError: [Errno 2] No such file or directory: 'mgm4536383.3.postQC_pairs_graftM/mgm4536383.3.postQC_pairs_F/mgm4536383.3.postQC_pairs_F_hits.aln.fa'

It’s great that you have identified the problem. Hope this bug will be fixed in the later version of GraftM. At this moment, is there way to get around this problem at the user’s end?

Thanks, Maggie

On Nov 29, 2018, at 5:57 PM, Ben J Woodcroft notifications@github.com<mailto:notifications@github.com> wrote:

Hi again, I just tried it with an McrA graftm package, and got this:

11/30/2018 08:46:25 AM INFO: Working on mgm4536383.3.postQC_pairs_F 11/30/2018 08:46:25 AM INFO: Working on forward reads 11/30/2018 08:46:52 AM INFO: 1 read(s) detected 11/30/2018 08:46:52 AM INFO: aligning reads to reference package database 11/30/2018 08:46:52 AM INFO: Filtered 1 short sequences from the alignment 11/30/2018 08:46:52 AM INFO: 0 sequences remaining 11/30/2018 08:46:52 AM INFO: No more aligned sequences to place! 11/30/2018 08:46:52 AM INFO: Working on reverse reads 11/30/2018 08:47:19 AM INFO: 1 read(s) detected 11/30/2018 08:47:19 AM INFO: aligning reads to reference package database 11/30/2018 08:47:19 AM INFO: Filtered 0 short sequences from the alignment 11/30/2018 08:47:19 AM INFO: 1 sequences remaining Traceback (most recent call last): File "/gnu/store/qcbyp8ixqhb72y4ykjqrn3xwr9g5i0v2-graftm-0.11.1/bin/..graftM-real-real", line 409, in Run(args).main() File "/gnu/store/qcbyp8ixqhb72y4ykjqrn3xwr9g5i0v2-graftm-0.11.1/lib/python2.7/site-packages/graftm/run.py", line 588, in main self.graft() File "/gnu/store/qcbyp8ixqhb72y4ykjqrn3xwr9g5i0v2-graftm-0.11.1/lib/python2.7/site-packages/graftm/run.py", line 483, in graft seqs_list=clusterer.cluster(seqs_list, REVERSE_PIPE) File "/gnu/store/qcbyp8ixqhb72y4ykjqrn3xwr9g5i0v2-graftm-0.11.1/lib/python2.7/site-packages/graftm/clusterer.py", line 82, in cluster reads=self.seqio.read_fasta_file(input_fasta) # Read in FASTA records File "/gnu/store/qcbyp8ixqhb72y4ykjqrn3xwr9g5i0v2-graftm-0.11.1/lib/python2.7/site-packages/graftm/sequence_io.py", line 49, in read_fastafile for name, seq, in self.each(open(path_to_fasta_file)): IOError: [Errno 2] No such file or directory: 'paired_bug_test/mgm4536383.3.postQC_pairs_F/mgm4536383.3.postQC_pairs_F_hits.aln.fa'

So, what is happening here is that a single read is detected in both directions, but then one is being filtered out, which causes the error. This is a bug in GraftM and should be fixed. Are you observing something similar?

Thanks.

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/geronimp/graftM/issues/254#issuecomment-443025285, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AUBJ23EzDz4OJnjOab5Qp6JHD7CikcmAks5u0GZVgaJpZM4VIiu0.

Maggie C.Y. Lau, PhD Visiting Collaborator Department of Geosciences B80 Guyot Hall Princeton University Princeton, NJ 08544, US

Professor Laboratory of Extraterrestrial Ocean Systems Institute of Deep-Sea Science and Engineering, Chinese Academy of Sciences No. 28, Luhuitou Road, Sanya 572000, Hainan Province, P.R. China Cell: +86 13034986006 (China) +1 609-356-8145 (WhatsApp & WeChat)

wwood commented 5 years ago

Hi, I believe I fixed this in ba6f11f which you can download from here: https://github.com/wwood/graftM/tree/fixes_dec2018

GraftM doesn't actually need to be installed to run it - it is possible to simply the "graftM" script from the bin directory. This or something like it will hopefully be released in mainline GraftM soon. I'm closing this now since hopefully it is fixed - I can reopen if there is some problem you come across.

Hope that helps - happy new year. ben

maglau commented 5 years ago

Thank you Ben. I will give it a try later.

I have also noticed another thing. For forward and reverse reads with sequence titles >abc.1 and >abc.2, they were counted as 2 hits. But if they named >abc 1:N:0 and >abc 2:N:0, they were counted as 1 hit. I have fixed this problem by fixing the sequence titles in the input file. You may consider modifying the parser scripts in the graftM packages or make a note to flag the users.

Happy holidays~~~

On Dec 21, 2018, at 1:12 AM, Ben J Woodcroft notifications@github.com<mailto:notifications@github.com> wrote:

Hi, I believe I fixed this in ba6f11fhttps://github.com/geronimp/graftM/commit/ba6f11f2fc58f47ea512eff336aa93169e056fe3 which you can download from here: https://github.com/wwood/graftM/tree/fixes_dec2018

GraftM doesn't actually need to be installed to run it - it is possible to simply the "graftM" script from the bin directory. This or something like it will hopefully be released in mainline GraftM soon. I'm closing this now since hopefully it is fixed - I can reopen if there is some problem you come across.

Hope that helps - happy new year. ben

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/geronimp/graftM/issues/254#issuecomment-449269221, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AUBJ2wRPz7HHyrVf-g1Hibtd5AxP1m5Rks5u7HvPgaJpZM4VIiu0.

Maggie C.Y. Lau, PhD Visiting Collaborator Department of Geosciences B80 Guyot Hall Princeton University Princeton, NJ 08544, US

Professor Laboratory of Extraterrestrial Ocean Systems Institute of Deep-Sea Science and Engineering, Chinese Academy of Sciences No. 28, Luhuitou Road, Sanya 572000, Hainan Province, P.R. China Cell: +86 13034986006 (China) +1 609-356-8145 (WhatsApp & WeChat)

wwood commented 5 years ago

Hi, yes you are right, GraftM does not recognize >abc.1 and >abc.2 as being paired. I'm curious, where do such names come from? Of a particular sequencing machine perhaps?

Unfortunately the framework of GraftM does not make it easy to associate reads together as pairs, because each file of the pairs are provided to HMMER (or DIAMOND) separately.

maglau commented 5 years ago

Some files I found on MG-RAST. A note to users about the format of sequence title will be useful. You might have given such tips and I have overlooked.

On Dec 21, 2018, at 7:46 PM, Ben J Woodcroft notifications@github.com<mailto:notifications@github.com> wrote:

Hi, yes you are right, GraftM does not recognize >abc.1 and >abc.2 as being paired. I'm curious, where do such names come from? Of a particular sequencing machine perhaps?

Unfortunately the framework of GraftM does not make it easy to associate reads together as pairs, because each file of the pairs are provided to HMMER (or DIAMOND) separately.

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/geronimp/graftM/issues/254#issuecomment-449531601, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AUBJ23QlJfyuxIP47EdTTlVK8RXbYGCnks5u7YDYgaJpZM4VIiu0.

Maggie C.Y. Lau, PhD Visiting Collaborator Department of Geosciences B80 Guyot Hall Princeton University Princeton, NJ 08544, US

Professor Laboratory of Extraterrestrial Ocean Systems Institute of Deep-Sea Science and Engineering, Chinese Academy of Sciences No. 28, Luhuitou Road, Sanya 572000, Hainan Province, P.R. China Cell: +86 13034986006 (China) +1 609-356-8145 (WhatsApp & WeChat)