OLC-Bioinformatics / ConFindr

Intra-species bacterial contamination detection
https://olc-bioinformatics.github.io/ConFindr/
MIT License
22 stars 8 forks source link

DB indexing (samtools, kma):: DB location not writable #29

Closed EricDeveaud closed 1 year ago

EricDeveaud commented 2 years ago

Hello, related to #9 I was asked to install ConFinder on our cluster, our installation scheme is to install stuff on a shared NFS read-only drive.

I proceeded with the pipy confindr-0.7.4.tar.gz

compute nodes does nnot have access to internet, so we need to prepare the DBs for our users and we want to avoid multiples copies of ~/.confidr

so I've setup the databases using confindr_database_setup -o /opt/gensoft/exe/ConFindr/0.7.4/share/ and exported CONFINDR_DB=/opt/gensoft/exe/ConFindr/0.7.4/share/

but when I try to run the example I have the following error related to write permission on CONFINDR_DB location.

rpm_maker:ConFindr/0.7.4 > confindr.py -i example-data -o example-out
  2021-12-08 13:52:50  Welcome to ConFindr 0.7.4! Beginning analysis of your samples... 
  2021-12-08 13:52:50  Did not find rMLST databases, if you want to use ConFindr on genera other than Listeria, Salmonella, and Escherichia, you'll need to download them. Instructions are available at https://olc-bioinformatics.github.io/ConFindr/install/#downloading-confindr-databases

  2021-12-08 13:52:50  Beginning analysis of sample example... 
  2021-12-08 13:52:50  Checking for cross-species contamination... 
  2021-12-08 13:53:01  Extracting conserved core genes... 
  2021-12-08 13:53:06  Quality trimming... 
  2021-12-08 13:53:08  Detecting contamination... 
[E::fai_build3_core] Failed to open FASTA index /opt/gensoft/exe/ConFindr/0.7.4/share/Escherichia_db_cgderived.fasta.fai : Permission denied
Traceback (most recent call last):
  File "/opt/gensoft/exe/ConFindr/0.7.4/bin/confindr.py", line 11, in <module>
    load_entry_point('confindr==0.7.4', 'console_scripts', 'confindr.py')()
  File "/opt/gensoft/exe/ConFindr/0.7.4/venv/lib/python3.8/site-packages/confindr-0.7.4-py3.8.egg/confindr_src/confindr.py", line 1214, in main
  File "/opt/gensoft/exe/ConFindr/0.7.4/venv/lib/python3.8/site-packages/confindr-0.7.4-py3.8.egg/confindr_src/confindr.py", line 1051, in confindr
  File "/opt/gensoft/exe/ConFindr/0.7.4/venv/lib/python3.8/site-packages/confindr-0.7.4-py3.8.egg/confindr_src/confindr.py", line 691, in find_contamination
  File "/opt/gensoft/exe/ConFindr/0.7.4/venv/lib/python3.8/site-packages/pysam-0.18.0-py3.8-linux-x86_64.egg/pysam/utils.py", line 69, in __call__
    raise SamtoolsError(
pysam.utils.SamtoolsError: 'samtools returned with error 1: stdout=, stderr=[faidx] Could not build fai index /opt/gensoft/exe/ConFindr/0.7.4/share/Escherichia_db_cgderived.fasta.fai\n'

may I suggest that confindr_database_setup also perform the necessary indexing (samtools and kma) on downloaded files.

regards

Eric

adamkoziol commented 2 years ago

This seems like a very reasonable suggestion.

EricDeveaud commented 2 years ago

hello I'm currently taking a look at how db can be indexed by confidr_database_setup

if I understand correctly there is 2 possible options: 1) no rMLST all $GENERA_db_cgderived with GEERA in Escherichia, Listeria and Salmonella should be indexed sam indexed and by KMA with corresponding name

Escherichia_db_cgderived.fasta  -> Escherichia_db_cgderived.fasta.faid +  Escherichia_db_cgderived_kma.*
Listeria_db_cgderived.fasta  -> Listeria_db_cgderived.fasta.fai + Listeria_db_cgderived_kma.*
Salmonella_db_cgderived.fasta  -> Salmonella_db_cgderived.fasta.fai + Salmonella_db_cgderived_kma.*

is this correct

2) with rMLST files present same as above and $GENERA_db.fasta + fai indexes must be created using rMLST_combined.fasta file through the setup_allelespecific_database

Escherichia_db.fasta + fai
Listeria_db.fasta + fai
Salmonella_db.fasta + fai 

is this correct ?

regards

Eric

adamkoziol commented 2 years ago

It is.

However, since the rMLST database is designed to cover all prokaryotic genera, I am working on a fix by creating a set of all genera stored in the profiles.txt file, and making the allele-specific database for each one. This way, ConFindr won't crash when you process an Enterococcus, for example.

As you might expect, this will take a very long time - probably at least 24 hours, so you will only want to do this pre-indexing in cases such as yours.

EricDeveaud commented 2 years ago

Hello

great news. time won't be a problem

thnaks

pcrxn commented 1 year ago

Implemented in 241e4d9 for v0.8.1.