A few process questions

KDeaton commented 2 years ago

Hi there! I have few process questions:

In your paper you have several sets of metabolic targets, did you run the workflow separately for each list of targets?
Some of the genome annotations pass the build step and an sbml file is created, but have several warnings. Is there a way to evaluate annotation quality?
In one of my metagenome sets, a few of the builds fail and then the recon process stops, without continuing to create the sbml files. Is there a command to ignore the failed builds and continue?

cfrioux commented 2 years ago

Hi @KDeaton

If you are mentioning the genomes collection of 1,520 culturable species from the human gut microbiota, then yes, we ran the workflow separately for each set of targets. But you could run the pipeline for a wide list of targets without separating them, there are no technical restrictions.
Do you have an example of such warnings?
I'll let @ArnaudBelcour answer that third question. A brief and incomplete answer would be to separate the recon step from the rest of the pipeline and directly use the underlying mpwt package.

ArnaudBelcour commented 2 years ago

Hi @KDeaton,

To be more precise on the targets used in the article for the 1,520 culturable species in the human gut: we used the addedvalue (i.e. the metabolites producible by the community but not by individual alone) as the targets for the m2m pipeline (so for this pipeline we have only one list of targets). This allow us to find the key species associated to these targets. Then we classified these metabolites in 6 different categories (such as lipids or sugar). And we used each of these 6 groups as targets for the m2m_analysis pipeline to visualize the minimal communities with powergraphs. But we used this because we had a lot of targets in the addedvalue (156 metabolites).
As annotation quality do you refer to the genome annotation or the SBML quality? For the genome annotation, it is quite difficult to estimate the good quality of an annotation especially when dealing with metagenomics data. But it will depend on the tool used for the annotation such as Prokka or eggnog-mapper (Prokka being fast but less accurate than Eggnog-mapper). And there can big some big variations (for example in our article we have variations between genomes associated to 500 reactions to genomes associated to 2,500 reactionss, as you can see in the subfigure b. of this supplementary figure). For the SBML quality, the SBML files created by metage2metabo contain few annotations. And for example, SBML quality check tools (such as memote) might put a very low score to those SBML. Nonetheless, the information they contain is sufficient for m2m.
There is no command to ignore the failed builds with m2m. An easy option if you want to continue the analysis without the failed builds is to remove them from the input folder. You can find the failed build using the resume_inference.tsv inside the folder m2m_output_folder/pgdb_log, the failed builds have an ERROR in their gene_number column. By relaunching m2m, this will uses the successful builds stored in ptools-local folder and creates the corresponding SBML files. Another way is to keep the failed builds and try to create the PGDB files for the successful builds. There is a possible work-around with mpwt. I have released a new version of mpwt recently (0.7.0) that refactor how mpwt works. With this version each run is independent so if one fails the other will still be process till their ends. So in this case it will produce the PGDB files for m2m. If you can't update to this version, there is an option with older version of mpwt --ignore-error that will allow to continue the draft reconstruction even if some build have failed. In both case, you have to use mpwt command mpwt -f m2m_input_folder -o m2m_output_folder/pgdb --patho --flat --md -v --cpu X and by adding --ignore-error if you used the second option. But this will only produce PGDB files for the successful builds and it will not create the SBML files. To go further you need to fix the issue with the failed builds or remove them. To find why some builds failed you can take a look at the pathologic.log files located in the input folder. They should contain the errors encountered by Pathway Tools during the inference.

KDeaton commented 2 years ago

Thanks for both of your responses! I'm all set with questions 1 & 3. For more information on my question 2, when I ran a large metagenome that had a few builds fail, the resume_inference.tsv listed at least 10 in the pwt_warning column. When I ran recon again on a subset of genomes that had successful builds, the process finishes successfully and creates the sbml files, though I didn't get a resume_inference.tsv file. When I check the pathologic.log, there are several warnings. Here are some examples: Warning: The Location "join(1450558..1451127,1..24)" shows a first basepair number that is bigger than the second. This should only happen when crossing the origin. Warning: tRNA IPF37_06710 (NIL) may not have had parsable anticodon information. None assigned. No reaction or class having EC number 5.6.2.c can be found in the MetaCyc DB. Warning: enter-into-lookup-table-internal: Why does acylactivating have 53 associated reactions??

ArnaudBelcour commented 2 years ago

Thanks for the examples I better understand you question now.

These warnings come from Pathway Tools and they can have multiple meanings:

a certain number of the warnings are only for informations. They will not have an impact on the metabolic network. For example, it will be about consistency between gene location and nucleic sequence (but this can vary according to the codon table used).
other warning indicates that some specific informations are not found (such as the EC example which is a strange EC because a 'correct' EC has only number in it and not letter). This can come from the format of the information (for example wrong format for GO Terms or EC number). It can be the result of old version of tools used for the annotation of the genomes (which can use old version of the ontology associated to these informations). But it is possible to get it with current tool as standardization of informations is quite difficult.
some warnings are here to inform the user that some data in its input files are incompatible with some step of Pathway Tools. For example, I have often this issue when working with some datasets containing only proteins. As I do not have nucleic sequence the Hole Filler option will not be working on these data.
some warnings are associated to check on the metabolic network. The purpose of PathoLogic is to create draft metabolic network but as they are drafts they can require manual curation by the user to be sure that the associations are correct. I think your last warning is more in this case (I think the issue is the fact that an enzyme is associated to 53 reactions).

I put a print of these warnings but it is more an informations for the user. Some warnings can need a manual curation (to keep or not the reaction proposed/associated to the gene). For example, in your last example it can be interesting to look at the 53 reactions associated to an enzyme. The issue is when dealing with hundred/thousand of reconstructions we can not have the time to check all of them.

For the fact that mpwt did not produce log at your second run I will look into this to try to find why it failed.

AuReMe / metage2metabo

A few process questions #32