BigDataBiology / SemiBin

SemiBin: metagenomics binning with self-supervised deep learning
https://semibin.rtfd.io/
117 stars 11 forks source link

Error: MD5 check failed... & Error: Failed to open sequence file... #73

Closed SvetlanaUP closed 2 years ago

SvetlanaUP commented 2 years ago

Hi Shaojun, I just ran this, and got an error...

SemiBin single_easy_bin -i fa.gz -b .bam -o output 

SemiBin single_easy_bin -i fa.gz -b .bam -o output                                       2022-01-31 18:44:13,152 - Generate training data. 2022-01-31 18:44:32,483 - Calculating coverage for every sample. 2022-01-31 18:45:47,911 - Processed:CCMD75147712ST.mapped.sorted.bam 2022-01-31 18:45:48,080 - Start generating kmer features from fasta file. 2022-01-31 18:46:38,196 - Running mmseqs and generate cannot-link file. 2022-01-31 18:46:39,604 - Downloading GTDB to /Users/svetlana/.cache/SemiBin/mmseqs2-GTDB.  It will take a while.. #IT WORKED FOR MORE THAN 3 hours 2022-02-01 10:22:41,636 - Download finished. Checking MD5... Error: MD5 check failed, removing '/Users/svetlana/.cache/SemiBin/mmseqs2-GTDB/GTDB_v95.tar.gz'.

so I ran this instead (to save some time!). This was fast but got another error.

SemiBin single_easy_bin -i fa -b .bam -o output --environment human_gut

2022-02-01 11:20:03,427 - Generate training data. 2022-02-01 11:20:03,749 - Calculating coverage for every sample. 2022-02-01 11:21:19,435 - Processed:CCMD75147712ST.mapped.sorted.bam 2022-02-01 11:21:19,605 - Start generating kmer features from fasta file. 2022-02-01 11:22:08,353 - Start binning. 2022-02-01 11:22:09,940 - Calculating depth matrix. 2022-02-01 11:22:10,108 - Edges:143927 2022-02-01 11:22:14,539 - Reclustering.

Error: Failed to open sequence file /var/folders/zp/pmq94j9j04j7sp3z8r6ms5z40000gn/T/tmplu1eaiw_/contigs.faa.faa for reading

psj1997 commented 2 years ago

It is because that some dependencies cannot run on Mac. We will update this in the docs and the tool.

luispedro commented 2 years ago

If this is the ORF finder, can we just switch to another one? Why not prodigal even?

Looking at this again, I am not sure that FragGeneScan is being called correctly. At https://github.com/BigDataBiology/SemiBin/blob/9d5d5b79fb3caa4509a7aca4969fd78a5244fa7f/SemiBin/utils.py#L266, should it not -w 1 instead of -w 0?

psj1997 commented 2 years ago

I see this parameter -w 0 in other tool (Maxbin2; SolidBin), so I used this. But I agree that we can try to change to Prodigal in the later version.

Sincerely Shaojun

luispedro commented 2 years ago

My expectation is that it does not make much of a difference in terms of results, but we should check a few samples. In that case, we can make the choice based only on criteria like "easy to install and interface with".

psj1997 commented 2 years ago

I will try it in the later version.

luispedro commented 2 years ago

@psj1997 Can you open a new issue just for substituting the ORF finder? This one mixed a bit of different things, so maybe start clean