AuReMe / metage2metabo

From annotated genomes to metabolic screening in large scale microbiotas
https://metage2metabo.readthedocs.io
GNU Lesser General Public License v3.0
53 stars 7 forks source link

A few process questions #32

Closed KDeaton closed 2 years ago

KDeaton commented 2 years ago

Hi there! I have few process questions:

  1. In your paper you have several sets of metabolic targets, did you run the workflow separately for each list of targets?
  2. Some of the genome annotations pass the build step and an sbml file is created, but have several warnings. Is there a way to evaluate annotation quality?
  3. In one of my metagenome sets, a few of the builds fail and then the recon process stops, without continuing to create the sbml files. Is there a command to ignore the failed builds and continue?
cfrioux commented 2 years ago

Hi @KDeaton

  1. If you are mentioning the genomes collection of 1,520 culturable species from the human gut microbiota, then yes, we ran the workflow separately for each set of targets. But you could run the pipeline for a wide list of targets without separating them, there are no technical restrictions.
  2. Do you have an example of such warnings?
  3. I'll let @ArnaudBelcour answer that third question. A brief and incomplete answer would be to separate the recon step from the rest of the pipeline and directly use the underlying mpwt package.
ArnaudBelcour commented 2 years ago

Hi @KDeaton,

  1. To be more precise on the targets used in the article for the 1,520 culturable species in the human gut: we used the addedvalue (i.e. the metabolites producible by the community but not by individual alone) as the targets for the m2m pipeline (so for this pipeline we have only one list of targets). This allow us to find the key species associated to these targets. Then we classified these metabolites in 6 different categories (such as lipids or sugar). And we used each of these 6 groups as targets for the m2m_analysis pipeline to visualize the minimal communities with powergraphs. But we used this because we had a lot of targets in the addedvalue (156 metabolites).

  2. As annotation quality do you refer to the genome annotation or the SBML quality? For the genome annotation, it is quite difficult to estimate the good quality of an annotation especially when dealing with metagenomics data. But it will depend on the tool used for the annotation such as Prokka or eggnog-mapper (Prokka being fast but less accurate than Eggnog-mapper). And there can big some big variations (for example in our article we have variations between genomes associated to 500 reactions to genomes associated to 2,500 reactionss, as you can see in the subfigure b. of this supplementary figure). For the SBML quality, the SBML files created by metage2metabo contain few annotations. And for example, SBML quality check tools (such as memote) might put a very low score to those SBML. Nonetheless, the information they contain is sufficient for m2m.

  3. There is no command to ignore the failed builds with m2m. An easy option if you want to continue the analysis without the failed builds is to remove them from the input folder. You can find the failed build using the resume_inference.tsv inside the folder m2m_output_folder/pgdb_log, the failed builds have an ERROR in their gene_number column. By relaunching m2m, this will uses the successful builds stored in ptools-local folder and creates the corresponding SBML files. Another way is to keep the failed builds and try to create the PGDB files for the successful builds. There is a possible work-around with mpwt. I have released a new version of mpwt recently (0.7.0) that refactor how mpwt works. With this version each run is independent so if one fails the other will still be process till their ends. So in this case it will produce the PGDB files for m2m. If you can't update to this version, there is an option with older version of mpwt --ignore-error that will allow to continue the draft reconstruction even if some build have failed. In both case, you have to use mpwt command mpwt -f m2m_input_folder -o m2m_output_folder/pgdb --patho --flat --md -v --cpu X and by adding --ignore-error if you used the second option. But this will only produce PGDB files for the successful builds and it will not create the SBML files. To go further you need to fix the issue with the failed builds or remove them. To find why some builds failed you can take a look at the pathologic.log files located in the input folder. They should contain the errors encountered by Pathway Tools during the inference.

KDeaton commented 2 years ago

Thanks for both of your responses! I'm all set with questions 1 & 3. For more information on my question 2, when I ran a large metagenome that had a few builds fail, the resume_inference.tsv listed at least 10 in the pwt_warning column. When I ran recon again on a subset of genomes that had successful builds, the process finishes successfully and creates the sbml files, though I didn't get a resume_inference.tsv file. When I check the pathologic.log, there are several warnings. Here are some examples: Warning: The Location "join(1450558..1451127,1..24)" shows a first basepair number that is bigger than the second. This should only happen when crossing the origin. Warning: tRNA IPF37_06710 (NIL) may not have had parsable anticodon information. None assigned. No reaction or class having EC number 5.6.2.c can be found in the MetaCyc DB. Warning: enter-into-lookup-table-internal: Why does acylactivating have 53 associated reactions??

ArnaudBelcour commented 2 years ago

Thanks for the examples I better understand you question now.

These warnings come from Pathway Tools and they can have multiple meanings:

I put a print of these warnings but it is more an informations for the user. Some warnings can need a manual curation (to keep or not the reaction proposed/associated to the gene). For example, in your last example it can be interesting to look at the 53 reactions associated to an enzyme. The issue is when dealing with hundred/thousand of reconstructions we can not have the time to check all of them.

For the fact that mpwt did not produce log at your second run I will look into this to try to find why it failed.