bvalot / pyMLST

whole genome MLST analysis
Other
33 stars 5 forks source link

MLST import or create documentation #4

Closed lskatz closed 2 years ago

lskatz commented 2 years ago

Hi, I do not know what I'm doing wrong with this command. I have a database in ChewBBACA format with one locus per fasta file, and with many alleles in each locus.fasta file. How do I import it? Would appreciate some more extensive documentation and/or examples on this. Thank you!

(pymlst) [gzu2@monolith3 Salmonella_enterica.pyMLST]$ wgMLST import ../Salmonella_enterica.chewbbaca
Error: Database alreadly exists, use --force to override it
lskatz commented 2 years ago

More info, if this helps

(pymlst) [gzu2@monolith3 Salmonella_enterica.pyMLST]$ ls -lh ../Salmonella_enterica.chewbbaca | head
total 4.3G
-rwxrwx---. 1 gzu2 users       7.6K May 13 09:00 INNUENDO_cgMLST-00031717.fasta*
-rwxrwx---. 1 gzu2 users        26K May 13 09:57 INNUENDO_cgMLST-00031718.fasta*
-rwxrwx---. 1 gzu2 users       2.9K May 31  2021 INNUENDO_cgMLST-00031719.fasta*
-rwxrwx---. 1 gzu2 users        34K May 13 05:48 INNUENDO_cgMLST-00031720.fasta*
-rwxrwx---. 1 gzu2 users       1.9K May 13 07:02 INNUENDO_cgMLST-00031721.fasta*
-rwxrwx---. 1 gzu2 users        20K May 13 00:14 INNUENDO_cgMLST-00031722.fasta*
-rwxrwx---. 1 gzu2 users       5.7K May 13 12:06 INNUENDO_cgMLST-00031723.fasta*
-rwxrwx---. 1 gzu2 users       5.8K May 12 23:31 INNUENDO_cgMLST-00031724.fasta*
-rwxrwx---. 1 gzu2 users       7.9K May 12 19:36 INNUENDO_cgMLST-00031725.fasta*
(pymlst) [gzu2@monolith3 Salmonella_enterica.pyMLST]$ tree -d ../Salmonella_enterica.chewbbaca
../Salmonella_enterica.chewbbaca
└── short

1 directory
(pymlst) [gzu2@monolith3 Salmonella_enterica.pyMLST]$ grep -m 3 ">" ../Salmonella_enterica.chewbbaca/INNUENDO_cgMLST-00031717.fasta
>INNUENDO_cgMLST-00031717_1
>INNUENDO_cgMLST-00031717_2
>INNUENDO_cgMLST-00031717_3
bvalot commented 2 years ago

Hello,

PyMLST doesn't use chewbacca database. You need to create new ones. Here, you can for exemple create a new cgMLST database for Salmonella_enterica with this command:

wgMLST import Salmonella_enterica.pymlstdb Salmonella enterica

That would create the cgMLST database from cgmlst.org. Then you can add your strain you want to type with the add command.

lskatz commented 2 years ago

Thank you! I will try that! Could you also give an example command(s) on how to create a local database too?

lskatz commented 2 years ago

I seem to still have an error. I don't know if it's my firewall and so how can I troubleshoot it?

(pymlst) [gzu2@monolith3 Salmonella_enterica.pyMLST]$ wgMLST import Salmonella_enterica.pymlstdb Salmonella enterica
Error: Could not access to the server, please verify your internet connection
bvalot commented 2 years ago

Very strange. It's seems a problem with you internet connection. Can you access to this web site using your browser: https://www.cgmlst.org/ncs

Otherwise you can try to create a local database using a current schema. For this purpose, you need one fasta file containing the different genes of the schema but with only one allele for each in comparison to chewBacca that contains all alleles. wgMLST create Salmonella_enterica.pymlstdb genes.fasta

lskatz commented 2 years ago

Yes I'm able to get to that site with lynx https://www.cgmlst.org/ncs. I think that sometimes our firewall is funny though and we cannot access ftp sites.

Can I create a local database if I have the full fasta files with all alleles?

bvalot commented 2 years ago

No you need only one fasta file with only one allele by gene. It's was quiet easy to python script that from you chewbacca files

lskatz commented 2 years ago

Ok I think I understand that. But if I import only one allele per locus, then how do I call other alleles with the new database? Wouldn't I need other alleles in the database?

bvalot commented 2 years ago

No, you don't need because the database would be automatically extends with the alleles found in your strains

lskatz commented 2 years ago

Got it, thanks! I'll try this out next chance I get and so I'll close out this ticket for now.

lskatz commented 2 years ago

Thanks! It works now! I was able to query genomes with

(set -e; 
  for i in illumina/Salm/validation-dataset/shovill.out/*_1.shovillSpades.fasta;  do 
  b=$(basename $i _1.shovillSpades.fasta); 
  wgMLST add --strain $b MLST.db/Salmonella_enterica.pyMLST/Salmonella_enterica.pymlstdb $i; 
done;)
lskatz commented 2 years ago

In some instances I added a genome twice which broke my loop. So it might be useful to have a function to check whether a strain name has already been added to the database.

bvalot commented 2 years ago

There is one normally, that prompt you that you have already a strain in the database.

You can also remove strains witth "remove" command

lskatz commented 2 years ago

Okay cool, thanks!