Mysterious warnings/error when carving some genomes

franciscozorrilla commented 5 years ago

Hi Daniel,

I am trying to carve around 3000 genomes I obtained from a complete metagenomics dataset. I was able to carve most of them, but the last ~150 are giving me a hard time. Below is the output log file of one such failure. The warnings were also present in the logs of models that were able to be carved, and obtaining a new matplotlibrc file as suggested by the warnings did not help. It looks like the .tsv file is successfully created, but no .xml file is generated. Also, I tried playing around with the number of cores allocated, but it looks like 2 cores is sufficient.

Any idea what the problem could be?

Best wishes, Francisco

[Mon Feb 11 13:26:54 2019] Building DAG of jobs...
[Mon Feb 11 13:26:54 2019] Using shell: /usr/bin/bash
[Mon Feb 11 13:26:54 2019] Provided cores: 2
[Mon Feb 11 13:26:54 2019] Rules claiming more threads will be scaled down.
[Mon Feb 11 13:26:54 2019] Job counts:
[Mon Feb 11 13:26:54 2019]  count   jobs
[Mon Feb 11 13:26:54 2019]  1   carveme
[Mon Feb 11 13:26:54 2019]  1

[Mon Feb 11 13:26:54 2019] rule carveme:
[Mon Feb 11 13:26:54 2019]     input: bins_organized/ERR260149_228.fa
[Mon Feb 11 13:26:54 2019]     output: carvemeOut/ERR260149_228.xml
[Mon Feb 11 13:26:54 2019]     jobid: 0
[Mon Feb 11 13:26:54 2019]     wildcards: fasta=ERR260149_228

/c3se/users/zorrilla/Hebbe/.conda/envs/concoct_env/lib/python2.7/site-packages/matplotlib/colors.py:298: MatplotlibDeprecationWarning: The is_string_like function was deprecated in version 2.1.
  if cbook.is_string_like(arg):
/c3se/users/zorrilla/Hebbe/.conda/envs/concoct_env/lib/python2.7/site-packages/matplotlib/colors.py:351: MatplotlibDeprecationWarning: The is_string_like function was deprecated in version 2.1.
  if not cbook.is_string_like(arg) and cbook.iterable(arg):
/c3se/users/zorrilla/Hebbe/.conda/envs/concoct_env/lib/python2.7/site-packages/matplotlib/colors.py:765: MatplotlibDeprecationWarning: The is_string_like function was deprecated in version 2.1.
  not cbook.is_string_like(colors[0]):

Bad key "hatch.linewidth" on line 38 in
/c3se/users/zorrilla/Hebbe/.conda/envs/concoct_env/lib/python2.7/site-packages/matplotlib/mpl-data/stylelib/_classic_test.mplstyle.
You probably need to get an updated matplotlibrc file from
http://github.com/matplotlib/matplotlib/blob/master/matplotlibrc.template
or from the matplotlib source distribution

Bad key "xtick.top" on line 234 in
/c3se/users/zorrilla/Hebbe/.conda/envs/concoct_env/lib/python2.7/site-packages/matplotlib/mpl-data/stylelib/_classic_test.mplstyle.
You probably need to get an updated matplotlibrc file from
http://github.com/matplotlib/matplotlib/blob/master/matplotlibrc.template
or from the matplotlib source distribution
/c3se/users/zorrilla/Hebbe/.conda/envs/concoct_env/lib/python2.7/site-packages/pandas/core/groupby/groupby.py:4315: FutureWarning: arrays to stack must be passed as a "sequence" type such as list or tuple. Support for non-sequence iterables such as generators is deprecated as of NumPy 1.16 and will raise an error in the future.
  stacked_values = np.vstack(map(np.asarray, values))
[Mon Feb 11 13:34:56 2019] Waiting at most 5 seconds for missing files.
[Mon Feb 11 13:35:01 2019] MissingOutputException in line 23 of /c3se/NOBACKUP/groups/c3-c3se605-17-8/projects_francisco/binning/metabolismPipe/Snakefile:
[Mon Feb 11 13:35:01 2019] Missing files after 5 seconds:
[Mon Feb 11 13:35:01 2019] carvemeOut/ERR260149_228.xml
[Mon Feb 11 13:35:01 2019] This might be due to filesystem latency. If that is the case, consider to increase the wait time with --latency-wait.
[Mon Feb 11 13:35:01 2019] Shutting down, this might take some time.
[Mon Feb 11 13:35:01 2019] Exiting because a job execution failed. Look above for error message
[Mon Feb 11 13:35:01 2019] Complete log: /c3se/NOBACKUP/groups/c3-c3se605-17-8/projects_francisco/binning/metabolismPipe/.snakemake/log/2019-02-11T132654.523552.snakemake.log

cdanielmachado commented 5 years ago

Hi Francisco,

Can you run CarveMe in verbose mode (-v option) to get a better idea of what is happening?

franciscozorrilla commented 5 years ago

Thanks for the quick response! Sure, running right now with verbose mode. Also I just updated the entire matplotlib package and some of the warnings did go away. Will post update soon.

franciscozorrilla commented 5 years ago

Ok I see the problem now:

Running diamond...
diamond blastx -d /c3se/users/zorrilla/Hebbe/.conda/envs/concoct_env/lib/python2.7/site-packages/carveme/data/input/bigg_proteins.dmnd -q bins_organized/ERR260174_6.fa -o bins_organized/ERR260174_6.tsv --more-sensitive --top 10
Loading universe model...
Scoring reactions...
The input genome did not match sufficient genes/reactions in the database.

Would you recommend I just trash these genomes? Or is there some way I can still get models from them?

Thanks!

cdanielmachado commented 5 years ago

You said you built 3000 genomes from a metagenomics dataset. Are these metagenome-assembled genomes (MAGs)? Do you have a way to evaluate their quality?

franciscozorrilla commented 5 years ago

Yeah, basically de novo assembled genomes from metagenomics data. I used CONCOCT for binning and then CheckM for evaluating completeness and contamination. This evaluation actually filtered out like 90% of bins, so I was expecting the remaining ones to be good quality. I used cutoffs of 80% completeness and 10% contamination for filtering, but maybe those are too lenient?

cdanielmachado commented 5 years ago

I have never built MAGs myself, so I don't have any experience in this. I would say that with 80% completeness you should get a good enough genome for reconstruction.

Either these genomes have too few metabolic genes, or the organisms are too distantly related to those in the BiGG database to find any homology matching.

Can you check what is the number of genes in these genomes?

franciscozorrilla commented 5 years ago

I see, do you have a rough estimate of what the min number of genes should be? I checked a number of my genomes that failed to get carved and it looks like they have between 30-40 genes.. probably too low? Kind of suspicious that these genomes managed to get past the 80% completeness cutoff..

cdanielmachado commented 5 years ago

Mycoplasma genitalium is supposedly one of the smallest genomes currently known and it has 470 genes...

I guess this is not a problem in CarveMe, so I will close this issue. My suggestion is that you ignore these genomes :)

franciscozorrilla commented 5 years ago

Ok, thanks for the help!

One last question: I just noticed a genome with 201 genes that failed to get carved. The log shows:

[Mon Feb 11 15:21:18 2019] Building DAG of jobs...
[Mon Feb 11 15:21:18 2019] Using shell: /usr/bin/bash
[Mon Feb 11 15:21:18 2019] Provided cores: 16
[Mon Feb 11 15:21:18 2019] Rules claiming more threads will be scaled down.
[Mon Feb 11 15:21:18 2019] Job counts:
[Mon Feb 11 15:21:18 2019]  count   jobs
[Mon Feb 11 15:21:18 2019]  1   carveme
[Mon Feb 11 15:21:18 2019]  1

[Mon Feb 11 15:21:18 2019] rule carveme:
[Mon Feb 11 15:21:18 2019]     input: bins_organized/ERR260185_45.fa
[Mon Feb 11 15:21:18 2019]     output: carvemeOut/ERR260185_45.xml
[Mon Feb 11 15:21:18 2019]     jobid: 0
[Mon Feb 11 15:21:18 2019]     wildcards: fasta=ERR260185_45

/c3se/users/zorrilla/Hebbe/.conda/envs/concoct_env/lib/python2.7/site-packages/pandas/core/groupby/groupby.py:4315: FutureWarning: arrays to stack must be passed as a "sequence" type such as list or tuple. Support for non-sequence iterables such as generators is deprecated as of NumPy 1.16 and will raise an error in the future.
  stacked_values = np.vstack(map(np.asarray, values))
Running diamond...
diamond blastx -d /c3se/users/zorrilla/Hebbe/.conda/envs/concoct_env/lib/python2.7/site-packages/carveme/data/input/bigg_proteins.dmnd -q bins_organized/ERR260185_45.fa -o bins_organized/ERR260185_45.tsv --more-sensitive --top 10
Loading universe model...
Scoring reactions...
Reconstructing a single model
[Mon Feb 11 16:11:54 2019] Error in rule carveme:
[Mon Feb 11 16:11:54 2019]     jobid: 0
[Mon Feb 11 16:11:54 2019]     output: carvemeOut/ERR260185_45.xml

[Mon Feb 11 16:11:54 2019] RuleException:
[Mon Feb 11 16:11:54 2019] CalledProcessError in line 29 of /c3se/NOBACKUP/groups/c3-c3se605-17-8/projects_francisco/binning/metabolismPipe/Snakefile:
[Mon Feb 11 16:11:54 2019] Command ' set -euo pipefail;  
[Mon Feb 11 16:11:54 2019]         set +u;source activate concoct_env;set -u
[Mon Feb 11 16:11:54 2019]         carve -v --dna bins_organized/ERR260185_45.fa -o carvemeOut/ERR260185_45.xml ' died with <Signals.SIGKILL: 9>.
[Mon Feb 11 16:11:54 2019]   File "/c3se/NOBACKUP/groups/c3-c3se605-17-8/projects_francisco/binning/metabolismPipe/Snakefile", line 29, in __rule_carveme
[Mon Feb 11 16:11:54 2019]   File "/c3se/users/zorrilla/Hebbe/.conda/envs/snakemake_env/lib/python3.6/concurrent/futures/thread.py", line 56, in run
[Mon Feb 11 16:11:54 2019] Shutting down, this might take some time.
[Mon Feb 11 16:11:54 2019] Exiting because a job execution failed. Look above for error message
[Mon Feb 11 16:11:54 2019] Complete log: /c3se/NOBACKUP/groups/c3-c3se605-17-8/projects_francisco/binning/metabolismPipe/.snakemake/log/2019-02-11T152118.628448.snakemake.log
slurmstepd: error: Detected 1 oom-kill event(s) in step 76760.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.

This time I assigned 16 cores with around a total of 45 GB RAM, but it is giving me an out of memory kill event.

Any idea how much memory I should be allocating?

Here is a plot showing resource utilization:

It looks like something is using up a lot of swap memory right before the job gets killed..

cdanielmachado commented 5 years ago

How many reconstructions do you do in a single job?

I created this database with 5587 models using 4GB per job and only 1 reconstruction per job. They all completed successfully:

https://github.com/cdanielmachado/embl_gems

franciscozorrilla commented 5 years ago

Hmmm interesting. I was doing 1 reconstruction per job as well, not using the recursive options, just because its easier to spread the jobs using snakemake cluster jobs this way.

rule carveme:
    input:
        "bins_organized/{fasta}.fa"
    output:
        "carvemeOut/{fasta}.xml"
    shell:
        """
        set +u;source activate concoct_env;set -u
        carve -v --dna {input} -o {output}
        """

Not sure why my jobs are demanding so much more memory ..

In any case, I will probably just set aside the last 150 genomes that are giving me trouble and just work with the 2800 that I have already generated.

This may be out the scope your experience, but have you noticed that carved models seem to fail/skip more than half memote tests? Not sure if this is generally the case in your experience as well, or maybe I just had bad MAGs?

Typical test results:


======== 81 failed, 43 passed, 21 skipped, 3 warnings in 18.38 seconds =========

cdanielmachado commented 5 years ago

I tried a very early version of memote, I don't know how it is performing now.

But if you want to use memote, I would suggest you use the option --flavor fbc2 when building models. CarveMe outputs the legacy cobra format by default (which memote does not like).

franciscozorrilla commented 5 years ago

Ah that is good to know! Thanks for the tips and discussion.

EmmettPeng commented 3 weeks ago

@franciscozorrilla Hi Francisco, may I know what type of sample is your MAGs from, like soil or human gut? When you reconstructing metabolic models, did you use gap filling? If you did, what media did you use?

cdanielmachado / carveme

Mysterious warnings/error when carving some genomes #49