BinPro / CONCOCT

Clustering cONtigs with COverage and ComposiTion
Other
119 stars 48 forks source link

Bin 0 not analyzed in Checkm #241

Closed choosehappy closed 5 years ago

choosehappy commented 5 years ago

This seems related to #240

I know that the developers of CONCOCT have used CheckM in the past to validate their results, so perhaps this is a new bug?

CONCOCT starts labeling BIN at 0 to:

image

and produces 29 bins:

ajanowcz@sib-pc25:/export/big/ajanowcz/m/ea2/concoct-output$ cat clustering_merged.csv  | cut -f 2 -d , | sort | uniq -c | wc -l
29

When running checkm, only 28 bins are detected: image

And Bin 0 does not appear in the output:

image

as well there is no bin 0 in the bin directory:

1  10  11  12  13  14  15  16  17  18  19  2  20  21  22  23  24  25  26  27  28  3  4  5  6  7  8  9

I can confirm this happens with at least 2 different unrelated samples.

Is this intended or is there something about bin 0? is it a "garbage" bin not intended for analysis usage?

alneberg commented 5 years ago

Thank you @choosehappy, for a very detailed report! Can it be the extract fasta script which is the problem? Could you check whether there is a 0.fa exported from the concoct results to begin with?

choosehappy commented 5 years ago

it does not appear to be for any of my samples (n=4)

alneberg commented 5 years ago

Sorry, what do you mean, does the concoct script extract_fasta_bins.py output the 0.fa or not?

choosehappy commented 5 years ago

it does not

image

image

alneberg commented 5 years ago

I see, so the problem is with that script, and not with CheckM... I did some recent changes to that script. Which version of it is it that you're using?

choosehappy commented 5 years ago

It looks like the one from October, id: dd180834025c66002e51a9e31440d588d0e5c183

https://github.com/BinPro/CONCOCT/commit/dd180834025c66002e51a9e31440d588d0e5c183

alneberg commented 5 years ago

Thank you! So the extract script looks like this? https://github.com/BinPro/CONCOCT/blob/dd180834025c66002e51a9e31440d588d0e5c183/scripts/extract_fasta_bins.py

choosehappy commented 5 years ago

you would think so, but in fact it appears to be a bit different.

i can assure you i have no modified it : )

root@5de4c4a2be1f:/opt/CONCOCT/scripts# diff extract_fasta_bins_github.py extract_fasta_bins.py
2a3
>
24,27c25,32
<         output_file = os.path.join(args.output_path, "{0}.fa".format(cluster_id))
<         seqs = [all_seqs[contig_id] for contig_id in contig_ids]
<         with open(output_file, 'w') as ofh:
<             SeqIO.write(seqs, ofh, 'fasta')
---
>         try:
>             print(contig_ids)
>             output_file = os.path.join(args.output_path, "{0}.fa".format(cluster_id))
>             seqs = [all_seqs[contig_id] for contig_id in contig_ids]
>             with open(output_file, 'w') as ofh:
>                 SeqIO.write(seqs, ofh, 'fasta')
>         except:
>             print(contig_id)
alneberg commented 5 years ago

Hmm, strange. I cannot find any branch with that version of the file. The history of the main file does not show anything like that. How have you installed concoct? Did you use any non-standard branch?

choosehappy commented 5 years ago

I actually have notes since i've done this in a docker image...

nothing seems out of the ordinary

git clone https://github.com/BinPro/CONCOCT.git
cd CONCOCT

pip2 install -r requirements.txt 
python setup.py install
nosetests

maybe i should git pull the latest version and try again?

i probably should have done this originally but (a) wasn't sure if this was a concoct or checkm issue and (b) i'd need to be very careful to not destroy my data reproducibliity

alneberg commented 5 years ago

You could try to run the github version of the extract_fasta_bins.py script and see if that makes a difference first?

choosehappy commented 5 years ago

new one doesn't work at all, in fact:

root@5de4c4a2be1f:/data/m/ea2# wget https://raw.githubusercontent.com/BinPro/CONCOCT/develop/scripts/extract_fasta_bins.py
c--2019-01-25 16:24:58--  https://raw.githubusercontent.com/BinPro/CONCOCT/develop/scripts/extract_fasta_bins.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... h151.101.112.133
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.112.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1299 (1.3K) [text/plain]
Saving to: 'extract_fasta_bins.py'

extract_fasta_bins.py                                       100%[========================================================================================================================================>]   1.27K  --.-KB/s    in 0s

2019-01-25 16:24:58 (106 MB/s) - 'extract_fasta_bins.py' saved [1299/1299]

root@5de4c4a2be1f:/data/m/ea2# chmod +x extract_fasta_bins.py
root@5de4c4a2be1f:/data/m/ea2# python e^C
root@5de4c4a2be1f:/data/m/ea2# ./extract_fasta_bins.py ^C
root@5de4c4a2be1f:/data/m/ea2# ./extract_fasta_bins.py megahit_output/final.contigs.fa ./concoct-output/clustering_merged.csv --output_path merged_fa_new
Traceback (most recent call last):
  File "./extract_fasta_bins.py", line 39, in <module>
    main(args)
  File "./extract_fasta_bins.py", line 27, in main
    seqs = [all_seqs[contig_id] for contig_id in contig_ids]
KeyError: 'contig_id'
root@5de4c4a2be1f:/data/m/ea2#

old one still works

alneberg commented 5 years ago

Ok, that error is because of a mismatch between the new output format and the old one. The new script assumes there is not a header in the clustering file. So removing the first line of the clustering file should fix that particular error

choosehappy commented 5 years ago

this is going to drive you crazy....but there is no header in my files:

image

choosehappy commented 5 years ago

sorry, mean there isn't one in the "merged" file, but there is in the original file.

i'm only using the merged file

alneberg commented 5 years ago

This is getting more and more exiting actually! Somehow the phrase 'contig_id' is added as a key in the contig_ids list...

alneberg commented 5 years ago

I will probably not have time to look at this until monday but could you export your environment with pip freeze and share it here with me, please?

Otherwise if you're used to debugging python this is probably not that complicated. Somewhere here https://github.com/BinPro/CONCOCT/blob/c21349bdbe68139e82fa81dc7e14e5e8679a72a3/scripts/extract_fasta_bins.py#L18-L25 the 'contig_id' is introduced.

But yes, happy to check it out for you next week if you'd like me to.

choosehappy commented 5 years ago

well essentially, it should be pretty straight forward to just extract the 0 fasta on its own and add it to that directory, right?

i assume the current head of this repo doesn't have this issue, so i can upgrade and use it on all further samples without any problems?

I just don't want to have to re-do all downstream analysis for this set, as we've only noticed this discrepancy today and we've been working with the output for some time now

alneberg commented 5 years ago

I'm afraid this seems to be an issue with the current head of the repo as well, as you just verified when trying the github version of the script? I'm suspecting it might be a difference between python 2 and python 3?

Yes, there is no magic happening in the fasta extraction script. If you're comfortable with programming you can write your own script to extract fasta files based on the clustering file.

For debugging and fixing this for other users however, it would greatly help me if I could see your python environment. Could you please supply the output of pip freeze?

choosehappy commented 5 years ago

sure, see below.

if there is a general solution possible, i'm happy to help debug

root@5de4c4a2be1f:/data/m/ea2/concoct-output# python --version
Python 2.7.12
root@5de4c4a2be1f:/data/m/ea2/concoct-output# pip freeze
backports.functools-lru-cache==1.5
bcbio-gff==0.6.4
biopython==1.72
bz2file==0.98
certifi==2018.10.15
chardet==3.0.4
checkm-genome==1.0.14
Click==7.0
colormath==3.0.0
concoct==0.5.0
cutadapt==1.18
cycler==0.10.0
Cython==0.29
decorator==4.3.0
DendroPy==4.4.0
future==0.17.1
idna==2.7
Jinja2==2.10
kiwisolver==1.0.1
lzstring==1.0.4
Markdown==3.0.1
MarkupSafe==1.1.0
matplotlib==2.2.3
multiqc==1.6
mysqlclient==1.3.13
networkx==2.2
nose==1.3.7
numpy==1.15.4
pandas==0.23.4
pyparsing==2.3.0
pysam==0.15.1
python-dateutil==2.7.5
pytz==2018.7
PyYAML==3.13
requests==2.20.1
scikit-learn==0.20.0
scipy==1.1.0
scour==0.32
simplejson==3.16.0
six==1.10.0
spectra==0.0.11
subprocess32==3.5.3
urllib3==1.24.1
xopen==0.3.5
choosehappy commented 5 years ago

found the problem

this command:

/opt/toolbox/scripts/concoct/merge_cutup_clustering.py ./concoct-output/clustering_gt2000.csv > ./concoct-output/clustering_merged.csv

actually still produces the header, but sticks it somewhere in the middle of the output file:

image

such that when the extract command takes place and encounters this line, then the error "KeyError: 'contig_id'" is produced

removing contig,0 from the merged file allows the extract script to work without error

alneberg commented 5 years ago

Oh, that's embarrassing! Thank you for pointing it out! It should be possible to remove any such header. I'll go ahead and close this issue now. Please let me know if you run into further issues.