KIT-IBG-5 / mdmcleaner

MDMcleaner the assessment, classification and refinement tool for microbial dark matter SAGs and MAGs
GNU General Public License v3.0
19 stars 6 forks source link

An error occured during blastp run with query '-' #55

Open rzhan186 opened 1 year ago

rzhan186 commented 1 year ago

Dear mdmcleaner developers,

I experienced a blastp error during mdmclean clean, which resulted in a runtime error. Could you have me troubleshoot please?

I was running mdmcleaner in a compute cluster using a virtual python (3.11) environment with the full mdmcleaner database. I've attached the log file here.

Meanwhile, I will try the database used in the pulibcation to see if this error is caused by the database.

Thank you for your help!

mdmcleaner_out.txt

Sincerely, Rui

rzhan186 commented 1 year ago

Just an update, with the reduced-sized database, I was able to run mdmclean clean successfully, which brought down contamination score from 14 to 7 as shown by checkm2. Therefore, I suspect this might be a database-related issue.

Another question I have right now is that it took about 9.5 hours with 125GB RAM on a compute cluster to decontaminate one MAG with a size of 3.3 million bases. Thus, it might take an exceptionally long time if I were to decontaminate hundreds of MAGs. I am wondering if there is any way to speed up the process. e.g., Will it run faster if I provide multiple MAGs at the same time?

Thank you!

jvollme commented 1 year ago

Sorry for the late reply. Good that you found a workaround for the reference Libra. However the error message mentions processes aborting with the signal "Signals.SIGABRT: 6". I am not sure, but I think this indicates the process being terminated on the side of your server (maybe you ran out of disk space? Or the queuing system automatically killed your process after a certain time?). Regarding the number of input genomes. Indeed mdmcleaner is **not at all meant to be ran separately for each genome**! The -i option takes multiple arguments so you can supply as many genomes as you want at the same time. E.g. like this:```mdmcleaner clean -i inputfolder/.fasta.gz```. Running it individually means it has to load the reference database again each time, and also re-rrun blasts for reference database ambiguities again and again. Running it once for all inputs means you share these runs and instances.

rzhan186 commented 1 year ago

Hi @jvollme, thanks for your reply! I tried re-downloading the updated database since I couldn't solve the previous database error. While doing this, a new error about the md5sum file missing arose. Thus, I went to the source code of read_gtdb_taxonomy.py and found that the script pulls the database from https://data.ace.uq.edu.au/public/gtdb/data/releases/latest, but currently there is no MD5SUM.txt file in this archive folder, unlike the previous releases. What I did is that I went to https://data.ace.uq.edu.au/public/gtdb/data/releases/release214/214.1/ and manually downloaded the MD5SUM.txt file from this folder and placed it i the mdmcleaner database folder, and deleted the "_r214" parts in the file while mdmclean makedb is being run. Eventually I managed to finish the database building process successfully. Now mdmclean clean is working perfectly. Just a heads-up for others who might be experiencing the same issue.

However, there is one thing I am not too sure about, when I check the DB_versions.txt file, I got the following

GTDB version = None RefSeq release = release218 silva_download_dict = 138.1

I am pretty sure that I have gtdb release 214.1, but why would it not show in this file? Could it be possible that gtdb wasn't downloaded successfully but 'mdmclean clean' still managed to run?

I've attached all files in the mdmcleaner database folder here mdmcleaner_database_files.txt