biologger / speciesprimer

The SpeciesPrimer pipeline is intended to help researchers finding specific primer pairs for the detection and quantification of bacterial species in complex ecosystems.
GNU General Public License v3.0
40 stars 19 forks source link

Pathovar level #5

Open Sirbius opened 4 years ago

Sirbius commented 4 years ago

Hi there, Thanks for this great container, very helpful. However, I have some troubles at species and pathovars level.

1) I need primers for a particular bacterial species. There are 14 genomes available on NCBI and I know that they are quite different between each other and probably represent two different clusters. Indeed, at the end of the pipeline I get primers that would fit for just one of the two cluster, leaving the second one without primers, as shown by the sequences_details.csv file.

Second issue, pathovar level. I guess the pipeline would also fit at the pathovar level but:

2) The first pathovar has only 2 genomes incomplete and It doesn't pass the quality control (even if I said to skip it) becuse there is no gff file (obviously). 28 Jul 2020 08:23:49: > Error 1: 28 Jul 2020 08:23:49: > Error: No .gff files found for QualityControl rRNA

3) The second pathovar has the similar name with the first species and thus, the program encounters this synonim and start the species-level primer design.

Thus, I was wondering if I can somehow define the input target sequence by specifing a path or an accession number (and not only the species) to avoid the 1) and 3) issue. (Got no solution for the issue 2). Thanks

biologger commented 4 years ago

Hi,

The lowest taxonomic level the pipeline recognizes currently is the subspecies level. This is triggered by the keyword "subsp" in the target name and is then also used during parsing of the BLAST results. This requires that also the entries in the BLAST database are named according to that convention, e.g. Lactococcus lactis subsp. lactis or Lactococcus lactis subsp. cremoris and so on.

To manually define the input sequences and synonyms you could use the offline option, add the selected input assemblies in the /primerdesign/species_name/genomic_fna directory, synonyms can be passed to the --exception option. However, this will still search specific primers on the species level, because the specificity check uses the species name to differentiate between target and off-target sequences.

For the issues in detail:

  1. Did all 14 genomes pass quality control according to the sequence_details.csv file? If yes, the pipeline should search for primers matching to both clusters or will not be able to find enough conserved core genes to design primers. I would try to rerun the pipeline with the skip_qc option. Rename or delete the old run directory first! To save time you could skip the download and annotation step by getting the excluded files from /primerdesign/excludedassemblies/species_name directory and copy them to the run directory. However, before you rerun the pipeline remove the Pangenome directory, otherwise the previously excluded assemblies will not be included in the pangenome analysis.

  2. Primer design with only 2 genomes may be a problem in general, increasing the number of genomes increases the chance to find an universal core gene for the cluster of genomes. However, it seems as if there is a problem during annotation using prokka if no gff files can be found. The gff files are also used downstream of the quality control. If you ran the pipeline with the --intermediate (keep intermediate files) option you could check if there are some fatal errors in the errorsummary.val file in the prokka output. Otherwise, in the log file there are the commands used for annotation as in this example: prokka --kingdom Bacteria --outdir GCF_008868375v1_20200409 --genus Brevibacterium --locustag GCF_008868375v1 --prefix GCF_008868375v1_20200409 --cpus 0 genomic_fna/GCF_008868375.1_ASM886837v1_genomic.fna You could try to run these two commands manually in the /primerdesign/species_name directory in the container and see if there is a problem with the input files.

  3. If there is a similar strict naming convention for pathovar level as for subsp. level, differentiation could be implemented in the pipeline. But, from what I've seen so far it is variably used at least in the NCBI "nt" database:

    >JF744990.1 Xanthomonas albilineans pathovar LCB233 alpha amino acid ester hydrolase (aeh) gene, complete cds
    >Z48759.1 X.campestris pathovar vesicatoria 3240 (NCPPB) fimA and fimB genes
    >DQ150584.1 Escherichia coli strain E22 pathovar EPEC Acf (acf) gene, complete cds
    >AB163302.1 Pseudomonas syringae pv. tagetis recA gene for recA protein, partial cds
    >AB163300.1 Pseudomonas syringae pv. syringae recA gene for recA protein, partial cds
    >AB163298.1 Pseudomonas syringae pv. spinaceae recA gene for recA protein, partial cds

If the database and the assembly names have the same pattern as Pseudomonas syringae above, I could probably implement this fast because it is the same pattern as for the subspecies level. Alternatively, manual download of sequences and a custom database and adding the pathovar (pv.) keyword could be a solution.

Sirbius commented 4 years ago

Hi there, Thank you for your detailed answer, it gave me a lot of hints to think about. To make everything more clear, I'm working with Pseudomonas avellanae (the one with two clusters, where one. let's call it cluster2, is actually another species but on NCBI still with the old taxonomy), Pseudomonas syringae pv. avellanae (and that's where the synonym problems starts, with only two genomes available) and Pseudomonas syringae pv. coryli (also with only two genomes).

Thanks to your suggestion (and to answer to your questions), I've been trying few things:

  1. Since not all the genomes passed the QC, I deleted the previous folder, add only the genomes belonging to cluster 1 to /primerdesign/Pseudomonas_avellanae/genomic_fna and run again the pipeline with the skip_qc option on. However, It took anyway the genomes belong to cluster 2 to finally end up with zero primers (which could make sense). I used the blastdb nt but I would try also ref-prok-rep. Thus, is there a way to exclude genomes? Maybe moving the cluster2-genomes.fna into /excludedassemblies ?

  2. I understand that 2 genomes are not enough, but that's what we got! I added manually the genomes-fna files into the /primerdesign/Pseudomonas_syringae_pv._coryli/genomic_fna with the skip_qc option on, prokka was fine and now is doing some blast, let's see what we get.

  3. Now you see the problem. Pseudomonas syringae pv. avellanae Vs Pseudomonas avellanae. Everytime I try to start the pipeline for the first one, the second one would get involved too. Maybe one possible solution could be related to the point 1., by adding somewhere the genomes to be excluded or by giving the specific accession numbers you want to use.

Keep you posted with further results, Silvia

biologger commented 4 years ago

Hi, This is quite a complicated case... Problems:

  1. For the NCBI taxonomy Pseudomonas syringae pv. avellanae is a heterotypic synonym of Pseudomonas avellanae. The automatic download will not work.
  2. The runs now will not yield specific primers because Pseudomonas syringae is seen as target.
  3. There could be naming problems in the BLAST database

Since the pipeline was outlined for species specific primers it is not intended to allow primers for subgroups, however this could be implemented at some point.

I have an idea to solve the specificity problem, it could work if I am able to implement the pathovar keyword. I will try this and report back.

biologger commented 4 years ago

Hi, I have implemented the pathovar (pv) keyword in a new beta release. Please pull the beta version and create a new container instance. sudo docker pull biologger/speciesprimer:v2.2-beta.4

sudo docker run -v $HOME/blastdb:/blastdb -v $HOME/primerdesign:/primerdesign \
-p 5000:5000 -p 9001:9001 --name betaprimer biologger/speciesprimer:v2.2-beta.4

example clean run:

# In a new terminal
sudo docker exec -it betaprimer bash
# Update the species_list or run with --nolist option
# Create the species & genome directory
mkdir Pseudomonas_syringae_pv_coryli
mkdir Pseudomonas_syringae_pv_coryli/genomic_fna
# Copy your target genomes into the genomic_fna directory.
# example run:
speciesprimer.py -t Pseudomonas_syringae_pv_coryli --nolist --offline --ignore_qc --email your@mail.com

There can still be problems with sequences in the BLAST database, however at least the target is now defined on the pathovar level. As possible solution then could be to create a custom database as outlined in this example. I hope this helps!

Sirbius commented 4 years ago

Hi there,

Thanks for the new release, I think now is working better than before. I run the container as you suggested but then I set the parameters on the web app.

I think I may have gotten good primers for pv coryli and pv. avellanae but I have to double check the canditate amplicons/templates on BLAST wgs and refseq, since I used the nt database and not all the sequences are present there. Moreover, I downloaded also the refseq database and set the path /blastdb/ref_prok_rep_genomes in the custom db but It gets stuck here: 03 Aug 2020 10:30:04: Please cite SpeciesPrimer if you use any of the results it produces: Dreier M, Berthoud H, Shani N, Wechsler D, Junier P. 2020. SpeciesPrimer: a bioinformatics pipeline dedicated to the design of qPCR primers for the quantification of bacterial species. PeerJ 8:e8544 https://doi.org/10.7717/peerj.854403 Aug 2020 10:30:04: Run: subspecies_handler(Pseudomonas_avellanae , space)03 Aug 2020 10:30:04: > Start log: Pseudomonas_avellanae 2020_08_03 I also tried to create my personal Pseudomonas database (and few other subsets) but got the same problem..It gets stuck at Start log. I see the proper folder inside blastdb/ both on my server and on the container, so it should be reachable. Any hint?

Lastly, going back to the two clusters of Pseudomonas avellanae, I put the fasta belonging to cluster2 inside excludedassemblies/Pseudomonas_avellanae and left the desired ones in /genomic_fna, set the offline mode and no download of new genomes, unfortunately still with the nt database. It should work this way, right?

Thanks a lot