EMBL-PKU / BASALT

MIT License
74 stars 12 forks source link

Error finding "quality_report.tsv" and "*vamb/clusters.tsv' #19

Open pthieringer opened 3 months ago

pthieringer commented 3 months ago

Hi BASALT Team,

Thanks for creating such a great tool! I have been trying out this program on 9 different samples and have been running BASALT individually for each set of assemblies on my HPC cluster (eg. each run of BASALT will contain the assemblies file with the forward and reverse short reads in a separate directory).

Interestingly, BASALT has been able to complete for 2 of my samples but has issues with the remaining 7. There appears to be some sort of issue with finding the "quality_report.tsv" file which I think is an output of CheckM2? The direct output in my error file indicates FileNotFoundError: [Errno 2] No such file or directory: 'quality_report.tsv'. For reference, the 2 samples that completed have the Final_bestbinset folder and have removed all the erroneous intermediate files while the other samples have a large collection of different files and directories.

I can't quite figure out what is creating the issue here from the error report, would someone be able to help me understand how to address this? I've attached the error and output files from one of the samples that failed (they all look the same just with different sample names). I also noticed that BASALT seems to have some difficulty with trying to move, copy, and remove files that don't seem to exist as also noted in the error output? Happy to provide more info should you need it!

In addition to this, I tried to run Basalt on the same set of samples but including the -e v flag to include the VAMB binning program. Though the error file says FileNotFoundError: [Errno 2] No such file or directory: '/home/pthierin/scratch/DEEP_VS_SHALLOW/07_BINNING/BASALT_WITH_VAMB/G20_300_M/1_G20_300_M-contigs-prefix-formatted-only.fa_vamb/clusters.tsv'. Do I need to install vamb separately to the initial BASALT installation? I am unable to call vamb within my BASALT conda environment.

Sorry for the long message, but I hope this provides enough clarity to try and solve the issue! Thanks!

basalt_399742_3_error.txt basalt_399742_3_output.txt

EMBL-PKU commented 3 months ago

Hi, Thank you very much for the patient in using BASALT. I read the text and it probably because there is really low mapping ratio, which may cause only few number of bins present in the binning folder. This situation may further cause there is no 'quality_report.tsv' file present because of lacking bin. Please check whether there is bin in the temporary folder, such as 'BestBinset'. Could you please let us know whether there is a bin under those folder? Thank you very much!

For the VAMB, we removed it from the package recently. My personal opinion of this is VAMB perform not that good comparing with the other tools. But it is my bad that I have not updated our instruction. If you need to perform VAMB, you may run it by yourself, and use the 'feeding' function of BASALT to merge these bins with the other bins generate from BASALT. I am going to write a small protocol for how to run BASALT. Hopefully I could put the protocol into github in a couple of days. Sorry for the inconvient.

pthieringer commented 3 months ago

Hello!

Thank you for such a thorough response! I checked the 'BestBinset' folder and there are 14 potential bins in that folder (they have the very long name prefix attached to it in the beginning). They appear to come from maxbin2 and concoct, but I see no bins that come from metabat2. There is also a 'quality_report.tsv' file (also with the long prefix name in front of it as such 1_G20_500_M-contigs-prefix-formatted-only.fa_BestBinSet_quality_report.tsv). If there is anything else you need just let me know. Do you have a solution in mind?

As for VAMB, this is good to know! I was only curious about being able to try VAMB as I have never used this tool before and wanted to compare my results of running BASALT with and without it. DO you have any insight as to why VAMB does not compare well with other tools?

Thanks and looking forward to your response!

EMBL-PKU commented 3 months ago

Hi, Please check the following folder to see whether there is no bin in the folder: 'BestBinset_outlier_refined_filtered' or 'BestBinset_outlier_refined_filtered_retrieved'. BASALT would first filter out those low-quality bins before carrying out contigs retrieval. If the completeness < 35 or contamination >= 20 of all the bins, those bins would be removed. Then it may stop the binning process. Using the '--min-cpn' and '--max-ctn' parameters could set up a loose cutoff and then help to keep more bins with low quality. This may help but you may find those bins with low-quality value kept in the folder.

For VAMB, we did a lot of tests. One example is that we used the CAMI medium dataset to test binning tools. As you may know, the data of the CAMI medium dataset came from 136 microbial genomes. However, our test on VAMB showed VAMB generated > 10000 bins in the test, and the number of high-quality bins was lower than metabat2, etc. So we finally decided not to use it. Although I am not sure about that, VAMB may have better performance in finding strain-level genomes. You can try to use it for your test.

pthieringer commented 3 months ago

Hello,

I checked both of those folders and they both contain the same 6 bins along with a quality report file within them. Only the 'BestBinset_outlier_refined_filtrate_retrieved' folder contains additional files which look to be a part of the BASALT pipeline - most of which are txt files. Do these folders represent the outlier bins that should not be considered in the final bin set? As in they are the lowest quality and do not meet the cutoff thresholds?

Thank you for the in depth description of your tests on VAMB and your insights! I appreciate you sharing that information and will keep that in mind moving forward with my dataset.

Thanks for all the help!

pthieringer commented 2 months ago

Hi,

I just wanted to follow up that I tried to run BASALT with VAMB and was able to have it run through most of the pipeline. It seems to get stuck at a point where it cannot find the file Predicted_potential_outlier.txt. Do you know what may be causing this issue?

I've copied along the error and output files in case that might help.

Thanks!

basalt_vamb_410036_1_error.txt basalt_vamb_410036_1_output.txt

vicru93 commented 2 weeks ago

Hi @pthieringer, did you finally use VAMB for your analysis with BASALT? How about your results? Were you able to compare the quantity and quality of the bins?

Best W,