carolzhou / multiPhATE2

multiPhATE with comparative genomics
18 stars 10 forks source link

Error: mdb_env_open: No such file or directory #44

Open ricmedveterinario opened 1 year ago

ricmedveterinario commented 1 year ago

Hello, we are running the software on a server and we are having some problems, as reported below:

Translate nucleic acid sequences Error: mdb_env_open: No such file or directory Translate nucleic acid sequences Traceback (most recent call last): File "/home/X/data/apps/multiphate/2.1/multiPhATE2/SequenceAnnotation/phate_sequenceAnnotation_main.py", line 1313, in blast.runBlast(myGenome.proteinSet,'protein') File "/home/X/data/apps/multiphate/2.1/multiPhATE2/SequenceAnnotation/phate_blast.py", line 958, in runBlast self.blast1fasta(fasta,outfile,database,dbName) File "/home/X/data/apps/multiphate/2.1/multiPhATE2/SequenceAnnotation/phate_blast.py", line 471, in blast1fasta tree.parse(outfile) File "/data/apps/miniconda3/4.12.0/envs/multiphate-2.1/lib/python3.9/xml/etree/ElementTree.py", line 580, in parse self._root = parser._parse_whole(source) xml.etree.ElementTree.ParseError: no element found: line 1, column 0 Translate nucleic acid sequences Translate nucleic acid sequences Translate nucleic acid sequences Translate nucleic acid sequences Translate nucleic acid sequences Translate nucleic acid sequences Error: mdb_env_open: No such file or directory Traceback (most recent call last): File "/home/X/data/apps/multiphate/2.1/multiPhATE2/SequenceAnnotation/phate_sequenceAnnotation_main.py", line 1313, in blast.runBlast(myGenome.proteinSet,'protein') File "/home/X/data/apps/multiphate/2.1/multiPhATE2/SequenceAnnotation/phate_blast.py", line 958, in runBlast self.blast1fasta(fasta,outfile,database,dbName) File "/home/X/data/apps/multiphate/2.1/multiPhATE2/SequenceAnnotation/phate_blast.py", line 471, in blast1fasta tree.parse(outfile) File "/data/apps/miniconda3/4.12.0/envs/multiphate-2.1/lib/python3.9/xml/etree/ElementTree.py", line 580, in parse self._root = parser._parse_whole(source) xml.etree.ElementTree.ParseError: no element found: line 1, column 0 Error: mdb_env_open: No such file or directory

What could be the problem and the solution for this?

Thanks,

ricmedveterinario commented 1 year ago

Would it have anything to do with this: https://stackoverflow.com/questions/59476703/error-mdb-env-open-no-such-file-or-directory-blast-local-database-problem ???

carolzhou commented 1 year ago

On Wed, Mar 22, 2023 at 2:06 PM Richard @.***> wrote:

Would it have anything to do with this: https://stackoverflow.com/questions/59476703/error-mdb-env-open-no-such-file-or-directory-blast-local-database-problem ???

— Reply to this email directly, view it on GitHub https://github.com/carolzhou/multiPhATE2/issues/44#issuecomment-1480260356, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACWGOO42O7YI45VCNDKSMBLW5NSW3ANCNFSM6AAAAAAWEK7VNA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

Possibly. You are getting an internal blast error. Sometimes errors happen with certain versions of a third party blast database. Are you gettinng this error with any sequence, or only with a particular one?

ricmedveterinario commented 1 year ago

I'm sending the bug reports. Maybe it helps. I believe that errors happen in all sequences.

I mistakenly closed the issue. Sorry. multfat2_M.zip

carolzhou commented 1 year ago

What database are you blasting against?

On Wed, Mar 22, 2023 at 2:23 PM Richard @.***> wrote:

I'm sending the bug reports. Maybe it helps. I believe that errors happen in all sequences.

I mistakenly closed the issue. Sorry. multfat2_M.zip https://github.com/carolzhou/multiPhATE2/files/11044870/multfat2_M.zip

— Reply to this email directly, view it on GitHub https://github.com/carolzhou/multiPhATE2/issues/44#issuecomment-1480278402, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACWGOO7OTCAZV6DG5MDDPKTW5NUTPANCNFSM6AAAAAAWEK7VNA . You are receiving this because you commented.Message ID: @.***>

ricmedveterinario commented 1 year ago

This is our .config file. Maybe it helps. multiPhate_Brady_V5.zip

carolzhou commented 1 year ago

My first guess would be that the version of Swissprot you are using is causing problems for Blast. There are several versions of the Swissprot/Uniprot database available from various sources. Some of these come pre-blast-formatted, others you have to use makeblastdb. These various versions have differences in the sub-files that are generated (e.g., .phr, .pin). The specifics of the blast internals are beyond my knowledge base, but I can tell you that some of the formatted Swissprot/Uniprot databases do not work with the latest version of the Blast program that you can acquire using Conda. So, if I were getting the error(s) you see, I would first go back to the README and verify that your multiPhATE2 set-up is consistent with the databases as described there. Also, please run multiPhate.py with "check_databases" = 'true', and post any error messages that pre-check might generate.

carolzhou commented 1 year ago

Also, double check that you see the protein.faa and gene.fnt files in each genome's output subdirectory, and that those files contain fasta data.

ricmedveterinario commented 1 year ago

Thanks for the quick feedback, I was confirming compliance of the databases, The test with the databases did not point out any errors. I am sending the .o output after the tests are finished.

I checked the protein.faa and gene.fnt files and they have content inside, I am sending a sample of the rotated data to you in .zip.

multfat2_M.o.zip NC_017249_1__0_partial_1.zip

ricmedveterinario commented 1 year ago

reopened

carolzhou commented 1 year ago

You might be overflowing memory. Try running a single genome, and see if the code completes without errors. Let me know what happens. Turn off all parallelism for this test.

ricmedveterinario commented 1 year ago

Hello,

I had already eliminated the banks refseq_protein_blast='false' and nr_blast='false', because they might go beyond my capacity on the servers, which is 500 gigabytes of RAM. I have no limits for storage memory, only for RAM.

I limit the capacity of jobs every time I run some software, and every time the software exceeds the amount I set, it sends me a message and aborts the execution. But this time it didn't happen, because I had established 100 gigabytes of RAM, and the software didn't reach even 25 gigabytes of RAM.

I need to use parallelism, because I work with hundreds of sequences,

I've already run this program without parallelism, but it can take weeks of work to finish running. I ran several tests, and only with parallelism does this issue exist.

I believe that there is some incompatibility between the parallelism function and the blast in some of the scripts, this is just a hypothesis. Logically, your program is wonderful, because I find it fundamental in my research. Just an assumption.

I performed a new test with only 2 sequences for analysis, with parallelism, limits of 30 cores and 100Gigas of RAM memory, and the same problem happened. It didn't even use 1 gig of RAM and only 3 cores. I am sending in .zip the amount of resources used on the server and the software reports.

Thanks, multfat2_M (tests_23_03_23).zip

carolzhou commented 1 year ago

I developed the code under Python 3.7. In theory, 3.9 should be fully backward compatible, but you never know. It might be worth a try back-installing 3.7 in your Conda environment. Just an idea.

ricmedveterinario commented 1 year ago

Yes I agree.

I'm performing the test now without the parallelism,

With just one fast sequence,

I'll come back with the results and try to reset the Python 3.7 conda, but this final part of the tests might take a little time,

Thanks,

carolzhou commented 1 year ago

Another thought: use only 1 blast thread, if not already.

carolzhou commented 1 year ago

I have run multiPhATE2 with success in parallel on a Linux cluster with a dozen or so genomes, but I have not tested scaling to as many genomes as you are running at once. The fact that you are getting this error only when running in parallel suggests that the error might arise from data clobbering. If any two (or more) processes are writing to the same places in memory, then this problem might arise when the amount of data becomes large enough to trigger a situation where memory allocations between processes are not being fully isolated during system memory management. This should not happen, or course, but maybe the system is being stressed and there is some issue in the system's memory management that only arises when there are a large number of processes executing at once. Just a thought. Another test you might do is run in parallel with just a few genomes at a time and with just one small blast database set as 'true'. If under that circumstance you do not see the error, then increase the computational load until you do.

ricmedveterinario commented 1 year ago

Yes I agree. I will perform the test with the version switch from Python 3.9 to 3.7. After I will perform the other tests. It is taking a little time because we are backing up some data. Thanks.

ricmedveterinario commented 1 year ago

We are running some tests. We have eliminate out the possibility that this is an incompatibility issue with the python 3.9 version. I will update soon on more testing.

carolzhou commented 1 year ago

Thank you

On Mon, Mar 27, 2023 at 5:31 PM Richard @.***> wrote:

We are running some tests. We have eliminate out the possibility that this is an incompatibility issue with the python 3.9 version. I will update soon on more testing.

— Reply to this email directly, view it on GitHub https://github.com/carolzhou/multiPhATE2/issues/44#issuecomment-1486041565, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACWGOO5QJ37U5XZS7QGEQNLW6IWPFANCNFSM6AAAAAAWEK7VNA . You are receiving this because you commented.Message ID: @.***>

carolzhou commented 1 year ago

You might have already seen this bug report, but if not ... https://github.com/rmhubley/RepeatMasker/issues/126. A comment within states that the error could possibly be arising due to the system not supporting file locking. Just a thought.

ricmedveterinario commented 1 year ago

Yes, I agree that it could be an option and we'll check it out here

We are working on solving this problem. We believe that we are now close to solving it, but we will confirm beforehand so that we do not give out any incorrect information.

At the beginning of the week, we will send the information we get.

Thank you very much.

ricmedveterinario commented 1 year ago

Hi @carolzhou,

I have a problem now with the Swissprot bank, which I've been trying to solve for 3 days, but I've exhausted all possibilities for changes,

I read everything that was possible in your instructions on the page https://github.com/carolzhou/multiPhATE2/blob/master/README.md,

I modified the database in several ways and I still don't understand what it could be,

I went through several scripts of your program, but this problem is a bigger challenge than me,

I'm sending you some docs to see if we can fix it,

Thanks in advance for your help, error.zip

carolzhou commented 1 year ago

Swissprot is frustrating. You have to use the right version of Blast with the right Swissprot source. There are incompatibilities with respect to the files created by makeblastdb. If you run this program using Swissprot from different sources, you will get different files (ie files with different extensions, such as .pin). The inner workings of blast are beyond my knowledge base, so I cannot tell you why different files are created under different circumstances, or why blast works differently depending on the files. I have been where you are now. Eventually I found a Blast that works with a Swissprot from a particular source. I do not have that combo at my fingertips, but I believe it is in the Readme, and will look again tomorrow to refresh my memory.

On Sun, Apr 2, 2023 at 5:34 PM Richard @.***> wrote:

Hi @carolzhou https://github.com/carolzhou,

I have a problem now with the Swissprot bank, which I've been trying to solve for 3 days, but I've exhausted all possibilities for changes,

I read everything that was possible in your instructions on the page https://github.com/carolzhou/multiPhATE2/blob/master/README.md,

I modified the database in several ways and I still don't understand what it could be,

I went through several scripts of your program, but this problem is a bigger challenge than me,

I'm sending you some docs to see if we can fix it,

Thanks in advance for your help, error.zip https://github.com/carolzhou/multiPhATE2/files/11133709/error.zip

— Reply to this email directly, view it on GitHub https://github.com/carolzhou/multiPhATE2/issues/44#issuecomment-1493487712, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACWGOO5AMVGSPIFFB5OMH33W7ILIZANCNFSM6AAAAAAWEK7VNA . You are receiving this because you were mentioned.Message ID: @.***>

carolzhou commented 1 year ago

For me, swissprot blastp works with the combination of BLAST downloaded via Conda, plus the swissprot database downloaded from Uniprot. These are the blast-formatted files that are generated for swissprot from uniprot: swissprot.fa swissprot.fa.pdb swissprot.fa.phr swissprot.fa.pin swissprot.fa.pot swissprot.fa.psq swissprot.fa.ptf swissprot.fa.pto

Problems arise when trying to use swissprot from other sources. For me, the following file sets, generated via other sources, generate the error message you are seeing:

swissprot.pdb swissprot.phr swissprot.pin swissprot.pog swissprot.pos swissprot.pot swissprot.ppd swissprot.ppi swissprot.psq swissprot.ptf swissprot.pto taxdb.btd taxdb.bti

and...

swissprot.phr swissprot.pin swissprot.psq swissprot.tar.gz

I do not know why different sources of swissprot data would generate different files upon running makeblastdb, nor can I explain why different version of blast might require different files. But if you use the combination of blast from Conda plus swissprot from uniprot, it should work and generate blast output--at least is works for me that way, as this example that I just ran:

sp|Q0PAR0|MACB_CAMJE Macrolide export ATP-binding/permease protein MacB OS=Campylobacter jejuni subsp. jejuni serotype O:2 (strain ATCC 700819 / NCTC 11168) OX=192222 GN=macB PE=3 SV=1 Length=641

Score = 32.7 bits (73), Expect = 9.8, Method: Compositional matrix adjust. Identities = 46/168 (27%), Positives = 67/168 (40%), Gaps = 36/168 (21%)

Query 273 LIFIDEPELYLHPSAINSVRESLVTLSESGYQVIISTHSASMLSAKHAANAIQVCKDSNG 332 LI DEP L + V E L L+E G+ +++ TH K AA A +V + +G Sbjct 158 LILADEPTGALDSKSGIMVLEILQKLNEQGHTIVLVTH-----DPKIAAQAKRVIEIKDG 212

Query 333 TIARKTISEKIEE-LYKSSSPQLHSAFT-LSNSSYLLFSEEVLLVEGKTETNVLYALYKK 390 I T EK +E L + P+ T L N ++ F Y
Sbjct 213 EILSDTKKEKAQEKLILKTMPKEKKTLTLLKNQAFECFK----------------IAYSS 256

Query 391 INGHELN------------PSKICIVAVDGKGSLFKMSQIINAIGIKT 426 I H+L S +C+VA+ G GS K+ + I +G T Sbjct 257 ILAHKLRSILTMLGIIIGIASVVCVVAL-GLGSQAKVLESIARLGTNT 303

Lambda K H a alpha 0.314 0.132 0.370 0.792 4.96

Gapped Lambda K H a alpha sigma 0.267 0.0410 0.140 1.90 42.6 43.6

ricmedveterinario commented 1 year ago

run multiphate2 complete.zip Hi @carolzhou, I believe I managed to solve the problem here, of course with your help,

As you showed me the way the swissprot database files should be organized, this helped me to solve,

In addition to not using blast+ outside the installation with conda,

I installed it together with the multiphate2 environment: conda install bioconda::blast=2.13.0

I did several tests after your message and instructions, and one thing worked, this command to format the folder: makeblastdb -in uniprot_sprot.fasta -dbtype prot -parse_seqids

This somewhat solved it: "-parse_seqids"

Series of steps I followed:

How to download and install the bank in the folder:

wget ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz gunzip uniprot_sprot.fasta.gz makeblastdb -in uniprot_sprot.fasta -dbtype prot -parse_seqids

folder will look like this:

uniprot_sprot.fasta uniprot_sprot.fasta.pdb uniprot_sprot.fasta.phr uniprot_sprot.fasta.pin uniprot_sprot.fasta.pjs uniprot_sprot.fasta.pog uniprot_sprot.fasta.pos uniprot_sprot.fasta.pot uniprot_sprot.fasta.psq uniprot_sprot.fasta.ptf uniprot_sprot.fasta.pto

How to configure the name of the .config file in:

Fasta blast database Locations

Specify locations of your local instances of fasta and blast-formatted databases; use full

path/filename for the sequence database (e.g., pVOGs.faa), unless your database is segmented;

in that case, specify the "root" name of the database (e.g., nr). Recall that the VOG protein

and gene databases need to be pre-processed, in order to search against the "tagged" data

sets (see README). Kegg requires a licence.

swissprot_database_path='/home/richard/multiphate_2/multiPhATE2/Databases/Swissprot/uniprot_sprot.fasta'

Now I will go over these steps to the research analysis server machine that we use and test the program there,

I will update soon

Thank you for your help,