Issue with pathway tools API limitation and PGDB entries verification

Hello,

I am trying to use M2M on a set of a few hundreds of bacterial genome. I am using m2m 1.5.0 with pathway tools 25.5 in a conda environment with python 3.6.

My issue is linked to the recon step and populating the pgdb local instance with my own genome. During the process, I start getting error message about having reached an API limit, usually after 256 genome insertions. Rerunning the same command allows me to go forward but I then have error message about with the pathologic files (already present in PGDB) or erroneous flat files from mpwt. If I restart with the same outdir and the --clean option, the accesory .log and .lisp files from my input directory are removed as well as the content of my pgdb local instance (ptools-local). It keeps all the entries inside my outdir but then fails on a File Error 17 because the genome is already present.

Basically, I think there is something weird going on during the verification step of doing either the flat files creation or the insertion inside PGDB. It is difficult for me to figure it out as the log states that some genome will be skipped because already present but still getting an error message about it in the end, probably because you need to handle 3 sets of genome entries (input dir, output dir, local PGDB)

Could you provide a guideline to how to run the workflow properly while avoiding issues linked to that pathway tools API limitation ?

Thank you

Hi Adifo,

Errors with Pathway Tools occur quite often, we do our best to prevent them or help troubleshooting them but it is not always easy. It could also be an error from our side through the mpwt dependency.

Could you please share with us some logs that you might have? Cheers

Hi Clémence,

I've overwritten my logs leading to the File Error 17 but I will try to recreate the circumstances. Now that I've figured out much of the issues regarding prokka output and the new gbff format, the types of error I had were lower and I manage to obtain all my sbml in two subsequent runs.

First, here is the log of the first run leading to API limitations error. Input dir gbk contains 404 directories with corresponding .gbk/.gbff file

Command: m2m workflow -g gbk -s seeds_workflow.sbml -o reconTestOutput -c 10 --clean

m2m_workflow.log: start_workflow.log

Directories count:

echo -ne "Input\t"; find ./gbk/ -name 'pathologic.log' | wc -l; echo -ne "Local\t"; find ~/ptools-local/pgdbs/ -name 'gc*' -type d | wc -l; echo -ne "Output\t"; find reconTestOutput/pgdb/ -name 'GC*' -type d | wc -l
Input   404
Local   404
Output  397

Second, rerun to complete the failing ones

Command: m2m workflow -g gbk -s seeds_workflow.sbml -o reconTestOutput -c 10

m2m_workflow.log: second.log

Directories count:

echo -ne "Input\t"; find ./gbk/ -name 'pathologic.log' | wc -l; echo -ne "Local\t"; find ~/ptools-local/pgdbs/ -name 'gc*' -type d | wc -l; echo -ne "Output\t"; find reconTestOutput/pgdb/ -name 'GC*' -type d | wc -l
Input   379
Local   404
Output  404

In this case, I managed to finish the workflow so I'm fine with the result. Having fixed the content of the gbk before hand probably helped (like Locus line format error, missing db_xref entries or erroneous taxon_id). But, I've lost some PathoLogic logs in my input dir and mpwt still displayed several errors trying to rerun on genome that were tagged as already present.

From what I remember of m2m code (reconstruction.py), the full list of input genome is compared to PGDB content. The mpwt process is skipped only if the two sets match fully. If it is incomplete, the input dir goes to mpwt without removing the ones already processed. Maybe adding a skip list to the multiprocess_pwt function could help.

Hi @Adifo,

As you already find a solution, I will just comment on what was the error.

During the inference by Pathway Tools, there is a step that will load NCBI citation. And with the multiprocessing sometimes there is too much queries send by Pathway Tools to the NCBI server(as we have X Pathway Tools processes that can query the NCBI, X being the number of CPU given with the option -c).

And it can lead to the following error:

    Fatal error: XML not well-formed - unrecognized content '{"error":"API rate limit exceeded","api-'

There is an option to avoid loading the NCBI citation (in the ptools-init.dat file located inside the ptools-local folder) named ###Batch-PathoLogic-Download-Pubmed-Entries? but I never got it to work.

For your second issue, indeed Metage2Metabo will only see if the output folder contains all the PGDB of the associated organisms. For all other cases, it is handled by mpwt. But as I have modified the behavior of mpwt with the version 0.7.0, it is possible that there is some issues.

I will try to investigate this when I have some times. Just to be sure, can you send me your version of mpwt to check if it is superior or equal to 0.7.0?

Thanks for the insight on the API error. I will keep that in mind when defining the number of threads next time.

Regarding mpwt, I've used mpwt version 0.7.1.

Here is the full list of software from my environment and from the subsequent pip.

From my perspective, you can close this issue but feel free to leave it open and reach out to me for feedback.

micromamba list
List of packages in environment: "/home/adf/micromamba/envs/m2m"

  Name              Version    Build               Channel    
────────────────────────────────────────────────────────────────
  _libgcc_mutex     0.1        conda_forge         conda-forge
  _openmp_mutex     4.5        1_gnu               conda-forge
  ca-certificates   2021.10.8  ha878542_0          conda-forge
  ld_impl_linux-64  2.36.1     hea4e1c9_2          conda-forge
  libffi            3.4.2      h7f98852_5          conda-forge
  libgcc-ng         11.2.0     h1d223b6_14         conda-forge
  libgomp           11.2.0     h1d223b6_14         conda-forge
  libnsl            2.0.0      h7f98852_0          conda-forge
  libstdcxx-ng      11.2.0     he4da1e4_14         conda-forge
  libzlib           1.2.11     h166bdaf_1014       conda-forge
  ncurses           6.3        h9c3ff4c_0          conda-forge
  openssl           1.1.1n     h166bdaf_0          conda-forge
  pip               21.3.1     pyhd8ed1ab_0        conda-forge
  python            3.6.15     hb7a2778_0_cpython  conda-forge
  python_abi        3.6        2_cp36m             conda-forge
  readline          8.1        h46c0cb4_0          conda-forge
  setuptools        58.0.4     py36h5fab9bb_2      conda-forge
  sqlite            3.37.1     h4ff8645_0          conda-forge
  tk                8.6.12     h27826a3_0          conda-forge
  wheel             0.37.1     pyhd8ed1ab_0        conda-forge
  xz                5.2.5      h516909a_1          conda-forge
  zlib              1.2.11     h166bdaf_1014       conda-forge

pip list
Package             Version
------------------- ---------
anyio               3.5.0
appdirs             1.4.4
argcomplete         2.0.0
argh                0.26.2
Arpeggio            2.0.0
async-generator     1.10
attrs               21.4.0
biopython           1.79
bubbletools         0.6.11
certifi             2021.10.8
cffi                1.15.0
chardet             4.0.0
charset-normalizer  2.0.12
clingo              5.5.1
clyngor             0.4.2
clyngor-with-clingo 5.3.post1
cobra               0.24.0
commonmark          0.9.1
contextvars         2.4
dataclasses         0.8
decorator           4.4.2
depinfo             1.7.0
diskcache           5.4.0
docopt              0.6.2
ete3                3.1.1
future              0.18.2
gffutils            0.10.1
graphviz            0.19.1
h11                 0.12.0
httpcore            0.14.7
httpx               0.22.0
idna                3.3
immutables          0.17
importlib-metadata  4.8.3
importlib-resources 5.4.0
iniconfig           1.1.1
lxml                4.8.0
MeneTools           3.2.1
Metage2Metabo       1.5.0
Miscoto             3.1.2
mpmath              1.2.1
mpwt                0.7.1
networkx            2.5.1
numpy               1.19.5
optlang             1.5.2
packaging           21.3
padmet              5.0.1
pandas              1.1.5
phasme              0.0.16
pip                 21.3.1
pluggy              1.0.0
powergrasp          0.8.18
py                  1.11.0
pycparser           2.21
pydantic            1.9.0
pydot               1.4.2
pyfaidx             0.6.4
Pygments            2.11.2
pyparsing           3.0.7
pyPEG2              2.15.2
pytest              7.0.1
python-dateutil     2.8.2
python-libsbml      5.19.2
pytz                2022.1
rfc3986             1.5.0
rich                12.2.0
ruamel.yaml         0.17.21
ruamel.yaml.clib    0.2.6
setuptools          58.0.4
simplejson          3.17.6
six                 1.16.0
sniffio             1.2.0
swiglpk             5.0.5
sympy               1.9
tomli               1.2.3
typing_extensions   4.1.1
wheel               0.37.1
zipp                3.6.0

AuReMe / metage2metabo

Issue with pathway tools API limitation and PGDB entries verification #37