Question about using crossMapParallel

franciscozorrilla / metaGEM

:gem: An easy-to-use workflow for generating context specific genome-scale metabolic models and predicting metabolic interactions within microbial communities directly from metagenomic data

https://franciscozorrilla.github.io/metaGEM/

MIT License

189 stars 41 forks source link

Question about using crossMapParallel #102

Closed yhbae6022 closed 1 year ago

yhbae6022 commented 2 years ago

Dear Mr. Zorrilla,

Thank you for inventing wonderful tool.

I could go through all tasks described in https://github.com/franciscozorrilla/unseenbio_metaGEM/README.md, using crossMapSerial.

But, after using crossMapParallel, concoct task fails with following error messages:

Building DAG of jobs...
MissingInputException in line 625 of /home2/channelbiome/metaGEM/Snakefile:
Missing input files for rule concoct:
    output: /home2/channelbiome/metaGEM/concoct/sample1/sample1.concoct-bins
    wildcards: IDs=sample1
    affected files:
        /home2/channelbiome/metaGEM/concoct/sample1/cov/coverage_table.tsv

Are there some other tasks should be done to use crossMapParallel such as kalistoIndex or kalisto2concoct ?

I've heard that crossMapSerial becomes impractical for large datasets. (https://github.com/franciscozorrilla/metaGEM/issues/60#issuecomment-870469471) So, I hope to go through all metaGEM tasks using crossMapParallel.

Best regards, Young-Ho

franciscozorrilla commented 2 years ago

Dear Young-Ho,

Indeed, in that tutorial I only showcase the simpler crossMapSeries subworkflow. For the crossMapParallel subworkflow you will indeed need to run kallistoIndex and kallisto2concoct. In short, the kallistoIndex rule generates an index file for each assembly, which is then used by the kallisto tool within the crossMapParallel rule. Finally, the kallisto2concoct rule aggregates individual mapping files into tables showing contig coverage across samples. For more details I can point you towards the methods section of the metaGEM paper detailing TARA ocean data analysis, as we used the paralell mapping subworkflow for this dataset.

For the TARA oceans dataset, the kallisto v0.46.1 tool was used for mapping quality controlled paired end reads to assemblies. First, each assembly was cut up into 10 kb chunks using the CONCOCT ‘cut_up_fasta.py’ script with parameters ‘-c 10000 -o 0 -m’. The cut up assembly was then used to generate a kallisto index using the ‘kallisto index’ command with default settings. Next, the ‘kallisto quant’ command was used with the ‘–plaintext’ setting to cross map each sample set of quality controlled paired end reads to each cut up assembly. Finally, the ‘kallisto2concoct.py’ script was used to summarize the mapping results across samples for each set of assembled contigs. For the TARA oceans dataset the binners were used as described above, but only CONCOCT used contig coverage across samples as input, while MetaBAT2 and MaxBin2 only used contig coverage from the assembled sample as input.

Please let me know if you have further questions and thank you for trying out our pipeline! 💎

Best wishes, Francisco

yhbae6022 commented 2 years ago

Francisico, thank you for answer.

I've tried crossMapParallel like following task flows: fastp > megahit > kallistoIndex > crossMapParallel > kallisto2concoct

There was no error until crossMapParallel, but kallisto2concoct did not work properly with following error messages:

Building DAG of jobs...
MissingInputException in line 20 of /home2/channelbiome/metaGEM/Snakefile:
Missing input files for rule all:
    affected files:
        /home2/channelbiome/metaGEM/concoct/sample3/cov/coverage_table.tsv
        /home2/channelbiome/metaGEM/concoct/sample2/cov/coverage_table.tsv
        /home2/channelbiome/metaGEM/concoct/sample1/cov/coverage_table.tsv

I think there is some error in line 20 of my Snakefile about rule all.

     20 rule all:
     21     input:
     22         expand(config["path"]["root"]+"/"+config["folder"]["concoct"]+"/{focal}/cov/coverage_table.tsv", focal = focal)
     23     message:
     24         """
     25         WARNING: Be very careful when adding/removing any lines above this message

Could you give me an advise to solve this problem?

Best regrards, Young-Ho

franciscozorrilla commented 2 years ago

Hi Young-Ho,

Thank you for reporting back. Here are some of my ideas:

Perhaps you tried to create the table before each crossMapping finished, can you check that each of the individual mapping operations was completed successfully?
Can you check the outputs of the kallisto2concoct, crossMapSeries, and crossMapParallel rules in your Snakefile? You will have to unsilence the silenced line in the output field:

https://github.com/franciscozorrilla/metaGEM/blob/b11dd7c2d21267a2d5aea198901e26bfbaa86feb/Snakefile#L598-L623

The line is in the output field is silenced by default to avoid conflicting/identical output rule definitions. E.g. to tell the concoct rule if it should get the coverage table from crossMapSeries or crossMapParallel -based rules. Therefore you will likely also need to silence the concoct output in the following rule:

https://github.com/franciscozorrilla/metaGEM/blob/b11dd7c2d21267a2d5aea198901e26bfbaa86feb/Snakefile#L390-L406

e.g.

 rule crossMapSeries: 
     input: 
         contigs = rules.megahit.output, 
         reads = f'{config["path"]["root"]}/{config["folder"]["qfiltered"]}' 
     output: 
         #concoct = directory(f'{config["path"]["root"]}/{config["folder"]["concoct"]}/{{IDs}}/cov'), 
         metabat = directory(f'{config["path"]["root"]}/{config["folder"]["metabat"]}/{{IDs}}/cov'), 
         maxbin = directory(f'{config["path"]["root"]}/{config["folder"]["maxbin"]}/{{IDs}}/cov')

Just in case you have not seen them, here are some other user-raised issues related to crossMapParallel:

Please let me know if this helps with your problems. Best, Francisco

yhbae6022 commented 2 years ago

Hi, Francisco.

Thanks to your advice, the problem could be solved.

I modified Snakefile as you said:

uncomment line 603 (#f'{config["path"]["root"]}/{config["folder"]["concoct"]}/{{focal}}/cov/coverage_table.tsv')
comment out line 395 (concoct = directory(f'{config["path"]["root"]}/{config["folder"]["concoct"]}/{{IDs}}/cov'),

After that, I could run kallistoIndex > crossMapParallel > kallisto2concoct > concoct > metabat > maxbin successfully.

But, when I tried to run binRefine, I faced another problem with following error messages:

Building DAG of jobs...
ChildIOException:
File/directory is a child to another output:
('/home2/channelbiome/metaGEM/concoct/sample1/cov', crossMapSeries)
('/home2/channelbiome/metaGEM/concoct/sample1/cov/coverage_table.tsv', kallisto2concoctTable)

The required files that were shown in error messages appear to be in the concoct folder as follows, but binRefine did not work.

channelbiome:~/metaGEM/concoct/sample1/cov$ ls -CsF
total 556
556 coverage_table.tsv

Any comment to resolve this problem would be appreciated.

Best regards, Young-Ho

franciscozorrilla commented 2 years ago

Hi Young-Ho,

The fact that the error is pointing towards crossMapSeries makes me think that this is still due to conflicting inputs/outputs between rules of the parallel-mapping-workflow and series-mapping-workflow. I would expect that if you properly silenced the concoct output field in the crossMapSeries rule then Snakemake shouldn't be able to "see" it, and instead use the crossMapParallel rules for determining dependencies.

Just to make sure, could you also double check the following?

Instead of silencing the crossMapSeries remove/delete it from the Snakefile, you can back up/re-clone the Snakefile later to run crossMapSeries in the future.
Check the contents of the coverage_table.tsv file, it should be a table with contigs for rows and samples for columns. Does the file have the correct dimensions?
Check if the individual binner ran succesfully, i.e. there should be sample-specific-MAGs in your concoct, maxbin, and metabat folders. Are there bins in each of the binner folders?
Similarly to issue #103, it is possible that Snakemake behaviour has changed since the development of metaGEM, could you try downgrading to v5.10.0?

Please let me know if this helps! Best, Francisco

yhbae6022 commented 2 years ago

Dear Francisco,

Sorry for late reply.

I was busy with other 16s data processing using QIIME2 and bioconductor. So, I could only try your suggestion 1 and 2.

For suggestion 1, I commented out rule crossMapSeries of Snakefile (line 390~490). And removed all intermediate outputs, re-executed tasks from fastp to concoct. I've encountered a new error like follows when I tried metabat task

Building DAG of jobs...
MissingInputException in line 753 of /home2/channelbiome/metaGEM/Snakefile:
Missing input files for rule metabatCross:
    output: /home2/channelbiome/metaGEM/metabat/sample1/sample1.metabat-bins
    wildcards: IDs=sample1
    affected files:
        /home2/channelbiome/metaGEM/metabat/sample1/cov

I could solve above problem by uncomment line 683 of Snakefile (#directory(f'{config["path"]["root"]}/{config["folder"]["metabat"]}/{{IDs}}/{{IDs}}.metabat-bins'), but I met another error like follows:

Running metabat2 ...
terminate called after throwing an instance of 'boost::wrapexcept<boost::program_options::too_many_positional_options_error>'
  what():  too many positional options have been specified on the command line
/bin/bash: line 58: 11793 Aborted                 (core dumped) metabat2 -i $fsampleID.fa $id.sort -s 50000 -v --seed 420 -t 0 -m 1500 -o $(basename $(dirname /home2/channelbiome/metaGEM/metabat/sample3/sample3.metabat-bins))

For suggestion 2, I guess my coverage_table.tsv has correct dimension like follows:

channelbiome:~/metaGEM/concoct$ wc -l `find . -name coverage_table.tsv`
   7718 ./sample1/cov/coverage_table.tsv
  10051 ./sample2/cov/coverage_table.tsv
   7843 ./sample3/cov/coverage_table.tsv
  25612 total
channelbiome:~/metaGEM/concoct$ head -3 ./sample1/cov/coverage_table.tsv
target_id       kallisto_coverage_sample1       kallisto_coverage_sample2       kallisto_coverage_sample3
k119_5532-flag=1-multi=5.0000-len=1352.concoct_part_0   6.508876        0.000000        0.000000
k119_22129-flag=1-multi=4.0000-len=1356.concoct_part_0  6.047198        0.000000        0.000000
channelbiome:~/metaGEM/concoct$ head -3 ./sample2/cov/coverage_table.tsv
target_id       kallisto_coverage_sample1       kallisto_coverage_sample2       kallisto_coverage_sample3
k119_59674-flag=1-multi=2.0000-len=1040.concoct_part_0  0.769231        1.153846        4.038462
k119_44204-flag=1-multi=3.0000-len=1069.concoct_part_0  2.245089        14.593078       4.677268
channelbiome:~/metaGEM/concoct$ head -3 ./sample3/cov/coverage_table.tsv
target_id       kallisto_coverage_sample1       kallisto_coverage_sample2       kallisto_coverage_sample3
k119_39765-flag=1-multi=5.0000-len=1125.concoct_part_0  7.111111        0.000000        0.711111
k119_20930-flag=1-multi=2.0000-len=1196.concoct_part_0  3.344482        0.000000        1.337793

I always appreciate your detailed and kind response.

Best regards, Young-Ho

franciscozorrilla commented 2 years ago

Dear Young-Ho,

Aplogies for my late response.

For suggestion 1, I commented out rule crossMapSeries of Snakefile (line 390~490). And removed all intermediate outputs, re-executed tasks from fastp to concoct. I've encountered a new error like follows when I tried metabat task

There is no need to re-exectute the rules from fastp to concoct, since you have already generated the individual binner results. The next step is to refine them, as you have attempted above. I believe that Snakemake is getting mixed up with rule dependencies between the two mapping workflows, and that is the reasoning behing the suggestion to remove potentially problematic rules that have already finished running or are not needed for the desired mapping mode (i.e. series or parallel). Sorry for these issues, the Snakefile is configured by default for the mapping in-series workflow. I will consider how to improve this aspect, perhaps it makes sense to have separate Snakefiles for the different mapping modes.

I have found that Snakemake sometimes gets confused and complains about incomplete files or missing files when it shouldn't, or it tries to re-generate already present files. To avoid this, it can sometimes help to delete the hidden snakemake folder that is automatically generated in your metaGEM directory e.g. rm -r .snakemake.

Also I curious if these problems are specific to a version of Snakemake. Could you tell me what version you are using? e.g. snakemake --version

Thanks and best wishes, Francisco

yhbae6022 commented 2 years ago

Dear Francisco,

My snakemake version was 7.2.1 I downgraded snakemake to v5.10.0, and I removed all intermediate outputs and all files in snakemake folder as you said like follows:

rm -rf assemblies/ benchmarks/ concoct/ kallisto* qfiltered/ stats/* tmp/* .snakemake/*

And executed all metaGEM tasks from fastp. But, there was another core dump when metabat2 ran like follows:

Running metabat2 ... 
terminate called after throwing an instance of 'boost::wrapexcept<boost::program_options::too_many_positional_options_error>'
  what():  too many positional options have been specified on the command line
/bin/bash: line 57:  2648 Aborted                 (core dumped) metabat2 -i $fsampleID.fa $id.sort -s 50000 -v --seed 420 -t 0 -m 1500 -o $(basename $(dirname /home2/channelbiome/metaGEM/metabat/sample3/sample3.metabat-bins))

I think I've found some clue to solve this problem.

The version of metabat2 in metagem conda environment is 2.15 (2020-01-04), and metabat2 in metawrap conda environment is 2.12.1 (2017-08-31).

When I try to run metabat in bash command line like follows,

metabat2 -i sample2.fa sample2.metabat-bins.sort -s 50000 -v --seed 420 -t 0 -m 1500 -o sample2

metabat2 in metagem environment crashes with a core dump, but metabat2 in metawrap environment works without error.

So, I changed the line 697 of SnakeFile from set +u;source activate {config[envs][metagem]};set -u; to set +u;source activate {config[envs][metawrap]};set -u;

After that, bash metaGEM.sh -t metabat -l works without error.

But, maxbin does not work properly after metabat with this error:

Building DAG of jobs...
MissingInputException in line 839 of /home2/channelbiome/metaGEM/Snakefile:
Missing input files for rule maxbinCross:
/home2/channelbiome/metaGEM/maxbin/sample1/cov

Regards, Young-Ho

franciscozorrilla commented 2 years ago

Dear Young-Ho,

Thank you very much for reporting back with this versioning issue. Indeed, I just checked the tool versions specified in the metaGEM paper (see Table 1 in methods section), and it shows MetaBAT2 v2.12.1. I also checked the metabat2 bitbucket repo and found that the issue has been reported, but no solution was provided. However, just so you know, I just checked the latest installation of metaGEM on my cluster, and I see that I have been running samples using MetaBAT2 v.2.15 without any issues.

and I removed all intermediate outputs and all files in snakemake folder as you said like follows:

Just to clarify, when I mention intermediate files I am refering to the files in your main metaGEM folder under subfolder names like concoct, assemblies, qfiltered, etc. When I talk about temporary files, those are the ones under subfolders in your tmp/ subdirectory. This temporary directory holds files that were generated by the jobs, but are not usually needed downstream. These subfolders can also balloon up to 100s of terabytes especially for cross-mapping operations, and that is why it is recommended to delete tmp/ files after you confirm that jobs ran successfully.

But, maxbin does not work properly after metabat with this error

It should be noted that the error here is not thrown by maxbin, but rather it is Snakemake that cannot resolve the file dependencies between the different tasks.

I think I see what the problem is here, and apologies for not catching it earlier. As I quoted in my first comment:

For the TARA oceans dataset the binners were used as described above, but only CONCOCT used contig coverage across samples as input, while MetaBAT2 and MaxBin2 only used contig coverage from the assembled sample as input.

Essentially, for the parallel cross map workflow, you should be using the maxbin and metabat rules, not the maxbinCross and metabatCross rules. To "switch" between these modes you will need to silence the ouput of the maxbinCross rule and un-silence the output of the maxbin rule. Same for metabat: silence the output of metabatCross and unsilence the output of metabat. See also this PR https://github.com/franciscozorrilla/metaGEM/pull/98

It was quite tricky dealing with this issue when we were developing/implementing. You can read a bit more about it in this issue.

How many metagenomic samples do you have? and what type of microbiomes are you looking at? Perhaps you may want to try the series mapping mode?

This could be another potential solution: split your dataset into chunks for cross mapping and binning. For example, if you had 200 samples you could split them into four 50 sample sub-datasets, cross map within those, and then generate bins.

I hope this helps! Best wishes, Fracisco

yhbae6022 commented 2 years ago

Dear Francisco,

Sorry for late reply. I should write a research proposal until 4th of May. I will try maxbin rule and split dataset into chunks after finishing the proposal.

Our team are trying to collect around 400 samples. So, I think it is hard to use series mapping mode.

Regards, Young-Ho