CCB-SB / plsdb

PLSDB pipeline to collect bacterial plasmids from NCBI
https://ccb-microbe.cs.uni-saarland.de/plsdb/
35 stars 4 forks source link

Species-specific plasmids? #6

Closed chahatupreti closed 3 years ago

chahatupreti commented 3 years ago

Hello!!

Thank you for this excellent resource. I was wondering if it was possible to fetch the plasmid sequences of the species of my interest using PLSDB. I am interested in E.faecalis, and would like to do a search on the plasmids found only in this species. Is this possible?

Thanks in advance!

VGalata commented 3 years ago

Dear @chahatupreti,

Unfortunately, this functionality is currently not implemented in our web server. We might consider to do that in the future. In the meantime, I would download the PLSDB data, filter the accessions by their taxon and use the list of retrieved IDs to filter the FASTA file.

chahatupreti commented 3 years ago

Thanks a lot Valentina. I guess then I will do it the way you suggested. I have 2 follow-up questions if you don't mind -

  1. I used the tsv file, and by filtering by my taxonID of interest, I now have a list of accessions. Now how do I get the FASTA sequences corresponding to these accessions? I tried to open the .fna.nsq file, but it is not normally readable.

  2. The latest data download is PLSDB_2020_06_29, so can I assume that it would probably have all complete plasmids of my organism of interest (E. faecalis) up to June 29?

Thank you so much! Chahat

VGalata commented 3 years ago

Dear Chahat,

  1. You have to use blast to extract the sequences from the blast database files. And I would suggest to use seqkit to filter that FASTA. See the code below.
  2. Yes, that is correct. Though, you have to keep in mind that we try to filter out duplicate records, i.e. if there are two records having the same sequence only one of them is kept. Though, it should not happen very often that there are identical records assigned to different species.
# extract the files
unzip plasmids__2020_06_29__v0.4.1-6-g8ad6422194.zip
# extract seqs from blastdb
blastdbcmd -entry all -db "plsdb.fna" -out "plsdb.fasta"
# get seq accessions for e. faecalis (tax id 1351)
# column 26 contains the species id, column 2 contains the accessions
awk -F"\t" '$26 == 1351 {print $2}' plsdb.tsv > efaecalis.txt
# filter seqs by their id/accession
seqkit grep -f efaecalis.txt plsdb.fasta -o efaecalis.fasta

I get here 96 records.

If you use conda, this is how I installed the required tools:

conda create -n test
conda activate test
conda install -c conda-forge -c bioconda blast=2.6.0 seqkit=0.14.0

Let me know if you have further questions!

Best, Valentina

chahatupreti commented 3 years ago

That did it Valentina! Thank you so much for your prompt and very helpful response!!!