Closed yhbae6022 closed 1 year ago
Dear Young-Ho,
Indeed, in that tutorial I only showcase the simpler crossMapSeries
subworkflow. For the crossMapParallel
subworkflow you will indeed need to run kallistoIndex
and kallisto2concoct
. In short, the kallistoIndex
rule generates an index file for each assembly, which is then used by the kallisto
tool within the crossMapParallel
rule. Finally, the kallisto2concoct
rule aggregates individual mapping files into tables showing contig coverage across samples. For more details I can point you towards the methods section of the metaGEM paper detailing TARA ocean data analysis, as we used the paralell mapping subworkflow for this dataset.
For the TARA oceans dataset, the kallisto v0.46.1 tool was used for mapping quality controlled paired end reads to assemblies. First, each assembly was cut up into 10 kb chunks using the CONCOCT ‘cut_up_fasta.py’ script with parameters ‘-c 10000 -o 0 -m’. The cut up assembly was then used to generate a kallisto index using the ‘kallisto index’ command with default settings. Next, the ‘kallisto quant’ command was used with the ‘–plaintext’ setting to cross map each sample set of quality controlled paired end reads to each cut up assembly. Finally, the ‘kallisto2concoct.py’ script was used to summarize the mapping results across samples for each set of assembled contigs. For the TARA oceans dataset the binners were used as described above, but only CONCOCT used contig coverage across samples as input, while MetaBAT2 and MaxBin2 only used contig coverage from the assembled sample as input.
Please let me know if you have further questions and thank you for trying out our pipeline! 💎
Best wishes, Francisco
Francisico, thank you for answer.
I've tried crossMapParallel
like following task flows:
fastp
> megahit
> kallistoIndex
> crossMapParallel
> kallisto2concoct
There was no error until crossMapParallel
, but kallisto2concoct
did not work properly with following error messages:
Building DAG of jobs...
MissingInputException in line 20 of /home2/channelbiome/metaGEM/Snakefile:
Missing input files for rule all:
affected files:
/home2/channelbiome/metaGEM/concoct/sample3/cov/coverage_table.tsv
/home2/channelbiome/metaGEM/concoct/sample2/cov/coverage_table.tsv
/home2/channelbiome/metaGEM/concoct/sample1/cov/coverage_table.tsv
I think there is some error in line 20 of my Snakefile about rule all.
20 rule all:
21 input:
22 expand(config["path"]["root"]+"/"+config["folder"]["concoct"]+"/{focal}/cov/coverage_table.tsv", focal = focal)
23 message:
24 """
25 WARNING: Be very careful when adding/removing any lines above this message
Could you give me an advise to solve this problem?
Best regrards, Young-Ho
Hi Young-Ho,
Thank you for reporting back. Here are some of my ideas:
Perhaps you tried to create the table before each crossMapping finished, can you check that each of the individual mapping operations was completed successfully?
Can you check the outputs of the kallisto2concoct
, crossMapSeries
, and crossMapParallel
rules in your Snakefile? You will have to unsilence the silenced line in the output field:
The line is in the output field is silenced by default to avoid conflicting/identical output rule definitions. E.g. to tell the concoct
rule if it should get the coverage table from crossMapSeries
or crossMapParallel
-based rules. Therefore you will likely also need to silence the concoct output in the following rule:
e.g.
rule crossMapSeries:
input:
contigs = rules.megahit.output,
reads = f'{config["path"]["root"]}/{config["folder"]["qfiltered"]}'
output:
#concoct = directory(f'{config["path"]["root"]}/{config["folder"]["concoct"]}/{{IDs}}/cov'),
metabat = directory(f'{config["path"]["root"]}/{config["folder"]["metabat"]}/{{IDs}}/cov'),
maxbin = directory(f'{config["path"]["root"]}/{config["folder"]["maxbin"]}/{{IDs}}/cov')
Please let me know if this helps with your problems. Best, Francisco
Hi, Francisco.
Thanks to your advice, the problem could be solved.
I modified Snakefile as you said:
#f'{config["path"]["root"]}/{config["folder"]["concoct"]}/{{focal}}/cov/coverage_table.tsv'
)concoct = directory(f'{config["path"]["root"]}/{config["folder"]["concoct"]}/{{IDs}}/cov'
), After that, I could run kallistoIndex
> crossMapParallel
> kallisto2concoct
> concoct
> metabat
> maxbin
successfully.
But, when I tried to run binRefine
, I faced another problem with following error messages:
Building DAG of jobs...
ChildIOException:
File/directory is a child to another output:
('/home2/channelbiome/metaGEM/concoct/sample1/cov', crossMapSeries)
('/home2/channelbiome/metaGEM/concoct/sample1/cov/coverage_table.tsv', kallisto2concoctTable)
The required files that were shown in error messages appear to be in the concoct
folder as follows, but binRefine did not work.
channelbiome:~/metaGEM/concoct/sample1/cov$ ls -CsF
total 556
556 coverage_table.tsv
Any comment to resolve this problem would be appreciated.
Best regards, Young-Ho
Hi Young-Ho,
The fact that the error is pointing towards crossMapSeries
makes me think that this is still due to conflicting inputs/outputs between rules of the parallel-mapping-workflow and series-mapping-workflow. I would expect that if you properly silenced the concoct output field in the crossMapSeries
rule then Snakemake shouldn't be able to "see" it, and instead use the crossMapParallel
rules for determining dependencies.
Just to make sure, could you also double check the following?
crossMapSeries
remove/delete it from the Snakefile, you can back up/re-clone the Snakefile later to run crossMapSeries
in the future.coverage_table.tsv
file, it should be a table with contigs for rows and samples for columns. Does the file have the correct dimensions?v5.10.0
?Please let me know if this helps! Best, Francisco
Dear Francisco,
Sorry for late reply.
I was busy with other 16s data processing using QIIME2 and bioconductor. So, I could only try your suggestion 1 and 2.
For suggestion 1, I commented out rule crossMapSeries
of Snakefile (line 390~490).
And removed all intermediate outputs, re-executed tasks from fastp
to concoct
.
I've encountered a new error like follows when I tried metabat
task
Building DAG of jobs...
MissingInputException in line 753 of /home2/channelbiome/metaGEM/Snakefile:
Missing input files for rule metabatCross:
output: /home2/channelbiome/metaGEM/metabat/sample1/sample1.metabat-bins
wildcards: IDs=sample1
affected files:
/home2/channelbiome/metaGEM/metabat/sample1/cov
I could solve above problem by uncomment line 683 of Snakefile (#directory(f'{config["path"]["root"]}/{config["folder"]["metabat"]}/{{IDs}}/{{IDs}}.metabat-bins'
),
but I met another error like follows:
Running metabat2 ...
terminate called after throwing an instance of 'boost::wrapexcept<boost::program_options::too_many_positional_options_error>'
what(): too many positional options have been specified on the command line
/bin/bash: line 58: 11793 Aborted (core dumped) metabat2 -i $fsampleID.fa $id.sort -s 50000 -v --seed 420 -t 0 -m 1500 -o $(basename $(dirname /home2/channelbiome/metaGEM/metabat/sample3/sample3.metabat-bins))
For suggestion 2, I guess my coverage_table.tsv has correct dimension like follows:
channelbiome:~/metaGEM/concoct$ wc -l `find . -name coverage_table.tsv`
7718 ./sample1/cov/coverage_table.tsv
10051 ./sample2/cov/coverage_table.tsv
7843 ./sample3/cov/coverage_table.tsv
25612 total
channelbiome:~/metaGEM/concoct$ head -3 ./sample1/cov/coverage_table.tsv
target_id kallisto_coverage_sample1 kallisto_coverage_sample2 kallisto_coverage_sample3
k119_5532-flag=1-multi=5.0000-len=1352.concoct_part_0 6.508876 0.000000 0.000000
k119_22129-flag=1-multi=4.0000-len=1356.concoct_part_0 6.047198 0.000000 0.000000
channelbiome:~/metaGEM/concoct$ head -3 ./sample2/cov/coverage_table.tsv
target_id kallisto_coverage_sample1 kallisto_coverage_sample2 kallisto_coverage_sample3
k119_59674-flag=1-multi=2.0000-len=1040.concoct_part_0 0.769231 1.153846 4.038462
k119_44204-flag=1-multi=3.0000-len=1069.concoct_part_0 2.245089 14.593078 4.677268
channelbiome:~/metaGEM/concoct$ head -3 ./sample3/cov/coverage_table.tsv
target_id kallisto_coverage_sample1 kallisto_coverage_sample2 kallisto_coverage_sample3
k119_39765-flag=1-multi=5.0000-len=1125.concoct_part_0 7.111111 0.000000 0.711111
k119_20930-flag=1-multi=2.0000-len=1196.concoct_part_0 3.344482 0.000000 1.337793
I always appreciate your detailed and kind response.
Best regards, Young-Ho
Dear Young-Ho,
Aplogies for my late response.
For suggestion 1, I commented out rule crossMapSeries of Snakefile (line 390~490). And removed all intermediate outputs, re-executed tasks from fastp to concoct. I've encountered a new error like follows when I tried metabat task
There is no need to re-exectute the rules from fastp to concoct, since you have already generated the individual binner results. The next step is to refine them, as you have attempted above. I believe that Snakemake is getting mixed up with rule dependencies between the two mapping workflows, and that is the reasoning behing the suggestion to remove potentially problematic rules that have already finished running or are not needed for the desired mapping mode (i.e. series or parallel). Sorry for these issues, the Snakefile is configured by default for the mapping in-series workflow. I will consider how to improve this aspect, perhaps it makes sense to have separate Snakefiles for the different mapping modes.
I have found that Snakemake sometimes gets confused and complains about incomplete files or missing files when it shouldn't, or it tries to re-generate already present files. To avoid this, it can sometimes help to delete the hidden snakemake folder that is automatically generated in your metaGEM directory e.g. rm -r .snakemake
.
Also I curious if these problems are specific to a version of Snakemake. Could you tell me what version you are using? e.g. snakemake --version
Thanks and best wishes, Francisco
Dear Francisco,
My snakemake version was 7.2.1 I downgraded snakemake to v5.10.0, and I removed all intermediate outputs and all files in snakemake folder as you said like follows:
rm -rf assemblies/ benchmarks/ concoct/ kallisto* qfiltered/ stats/* tmp/* .snakemake/*
And executed all metaGEM tasks from fastp
.
But, there was another core dump when metabat2
ran like follows:
Running metabat2 ...
terminate called after throwing an instance of 'boost::wrapexcept<boost::program_options::too_many_positional_options_error>'
what(): too many positional options have been specified on the command line
/bin/bash: line 57: 2648 Aborted (core dumped) metabat2 -i $fsampleID.fa $id.sort -s 50000 -v --seed 420 -t 0 -m 1500 -o $(basename $(dirname /home2/channelbiome/metaGEM/metabat/sample3/sample3.metabat-bins))
I think I've found some clue to solve this problem.
The version of metabat2
in metagem conda environment is 2.15 (2020-01-04),
and metabat2
in metawrap conda environment is 2.12.1 (2017-08-31).
When I try to run metabat in bash command line like follows,
metabat2 -i sample2.fa sample2.metabat-bins.sort -s 50000 -v --seed 420 -t 0 -m 1500 -o sample2
metabat2
in metagem environment crashes with a core dump,
but metabat2
in metawrap environment works without error.
So, I changed the line 697 of SnakeFile from
set +u;source activate {config[envs][metagem]};set -u;
to set +u;source activate {config[envs][metawrap]};set -u;
After that, bash metaGEM.sh -t metabat -l
works without error.
But, maxbin
does not work properly after metabat
with this error:
Building DAG of jobs...
MissingInputException in line 839 of /home2/channelbiome/metaGEM/Snakefile:
Missing input files for rule maxbinCross:
/home2/channelbiome/metaGEM/maxbin/sample1/cov
Regards, Young-Ho
Dear Young-Ho,
Thank you very much for reporting back with this versioning issue. Indeed, I just checked the tool versions specified in the metaGEM paper (see Table 1 in methods section), and it shows MetaBAT2 v2.12.1
. I also checked the metabat2 bitbucket repo and found that the issue has been reported, but no solution was provided. However, just so you know, I just checked the latest installation of metaGEM
on my cluster, and I see that I have been running samples using MetaBAT2 v.2.15
without any issues.
and I removed all intermediate outputs and all files in snakemake folder as you said like follows:
Just to clarify, when I mention intermediate files I am refering to the files in your main metaGEM folder under subfolder names like concoct
, assemblies
, qfiltered
, etc. When I talk about temporary files, those are the ones under subfolders in your tmp/
subdirectory. This temporary directory holds files that were generated by the jobs, but are not usually needed downstream. These subfolders can also balloon up to 100s of terabytes especially for cross-mapping operations, and that is why it is recommended to delete tmp/
files after you confirm that jobs ran successfully.
But, maxbin does not work properly after metabat with this error
It should be noted that the error here is not thrown by maxbin, but rather it is Snakemake that cannot resolve the file dependencies between the different tasks.
I think I see what the problem is here, and apologies for not catching it earlier. As I quoted in my first comment:
For the TARA oceans dataset the binners were used as described above, but only CONCOCT used contig coverage across samples as input, while MetaBAT2 and MaxBin2 only used contig coverage from the assembled sample as input.
Essentially, for the parallel cross map workflow, you should be using the maxbin
and metabat
rules, not the maxbinCross
and metabatCross
rules. To "switch" between these modes you will need to silence the ouput of the maxbinCross
rule and un-silence the output of the maxbin
rule. Same for metabat: silence the output of metabatCross
and unsilence the output of metabat
. See also this PR https://github.com/franciscozorrilla/metaGEM/pull/98
It was quite tricky dealing with this issue when we were developing/implementing. You can read a bit more about it in this issue.
How many metagenomic samples do you have? and what type of microbiomes are you looking at? Perhaps you may want to try the series mapping mode?
This could be another potential solution: split your dataset into chunks for cross mapping and binning. For example, if you had 200 samples you could split them into four 50 sample sub-datasets, cross map within those, and then generate bins.
I hope this helps! Best wishes, Fracisco
Dear Francisco,
Sorry for late reply.
I should write a research proposal until 4th of May.
I will try maxbin
rule and split dataset into chunks after finishing the proposal.
Our team are trying to collect around 400 samples. So, I think it is hard to use series mapping mode.
Regards, Young-Ho
Dear Mr. Zorrilla,
Thank you for inventing wonderful tool.
I could go through all tasks described in https://github.com/franciscozorrilla/unseenbio_metaGEM/README.md, using crossMapSerial.
But, after using crossMapParallel, concoct task fails with following error messages:
Are there some other tasks should be done to use
crossMapParallel
such askalistoIndex
orkalisto2concoct
?I've heard that
crossMapSerial
becomes impractical for large datasets. (https://github.com/franciscozorrilla/metaGEM/issues/60#issuecomment-870469471) So, I hope to go through all metaGEM tasks usingcrossMapParallel
.Best regards, Young-Ho