Add a batch mode to the dabase import ?

pimarin commented 8 months ago

Hello !

I'm trying to use pyMLST tool, which seem exactly what I need, but I don't understand how to make 1 database with some species to a specific analysis (from public database as cgmlst or pubmlst). I wondering if it's possible to download a list of all the available species schemes to a classical MLST or a cgMLST ? I'm working on a project with hundreds of different species, and we start with available database scheme, then we update with lacal shemes, and you tool allow this option by updating your own database.

Thank you.

bvalot commented 8 months ago

Hello,

I don't really understand what you need. All species cover by import are available on dedicated web site:

MLST : https://pubmlst.org/organisms/
cgMLST : https://cgmlst.org/ncs

You can load this database with import command

pimarin commented 7 months ago

Hi @bvalot !

I want to make an analysis like in classical mlst. I have some assemblies from different species and I want to use your tool with a databse build from cgmlst.org. But if a try to build a database without adding a species name, your tool ask me tonchoose fromba list.

bvalot commented 7 months ago

Yes, it is 2 steps analysis as example for Pseudomonas aeruginosa:

First you import the MLST database for your species claMLST import my_database.db Pseudomonas aeruginosa
Then you search you genome assembly with this database claMLST search my_database.db strain1.fasta strain2.fasta ...

marchoeppner commented 4 months ago

Maybe as an extension to this thread - I am building a pipeline for bacterial isolate analysis. The pipeline automatically provisions software via the usual channels; and I do plan to have a full "unattended" installation routine for the various dependencies, including the pyMLST MLST databases.

I have gotten fairly far with mining the pubmlst REST API to get a list of all available schemas and build a list of commands to use with claMLST import. But it still doesn't run fully unattended since it is largely unable to deal with the rare ambiguity.

Say, if I do:

claMLST import -m mlst Mbovis mycoplasma bovis

Even with me specifying the -m option, it still stops because it seems to use string matching to find the correct schema - and in this case there are two schemas with "mlst" in the name - 'mlst' and 'mlst (legacy)'. I haven't found a way yet to avoid this from happening. Truth be told, it would be much easier if there was a pre-built list of indices somewhere I could just download instead of going through a semi-interactive download procedure as currently implemented. Something like

claMLST import all

Worst case would be that I have to rebuild your import function with a little bit more logic to choose available schemas (or just download and build all of them....) and then build the schemas using claMLST create.

Cheers Marc

bvalot commented 4 months ago

Hi Marc,

In fact claMLST import fonction are more oriented to interactive import of specific shema using the current API of pubmlst. You can see all element in the API by omit species, but all species return don't have an mlst shema

claMLST import /path/to/database

If you want to prebuilt all possible mlst shema, I thnink the better way is to use the claMLST create fonction with allele and profiles download. All are listed here: https://pubmlst.org/data/

I just push a new release (2.1.6) that bypath some error when using claMLST create

bvalot commented 4 months ago

For you problem of mlst shema ambiguity, you can be more precise in the -m option using quotes.

claMLST import -m "mlst (legacy)" /tmp/Mbovis.db mycoplasma bovis

But in fact in this example, you can not import automatically the base mlst

bvalot / pyMLST

Add a batch mode to the dabase import ? #21