Closed choosehappy closed 5 years ago
Thank you @choosehappy, for a very detailed report! Can it be the extract fasta script which is the problem? Could you check whether there is a 0.fa
exported from the concoct results to begin with?
it does not appear to be for any of my samples (n=4)
Sorry, what do you mean, does the concoct script extract_fasta_bins.py
output the 0.fa
or not?
it does not
I see, so the problem is with that script, and not with CheckM... I did some recent changes to that script. Which version of it is it that you're using?
It looks like the one from October, id: dd180834025c66002e51a9e31440d588d0e5c183
https://github.com/BinPro/CONCOCT/commit/dd180834025c66002e51a9e31440d588d0e5c183
Thank you! So the extract script looks like this? https://github.com/BinPro/CONCOCT/blob/dd180834025c66002e51a9e31440d588d0e5c183/scripts/extract_fasta_bins.py
you would think so, but in fact it appears to be a bit different.
i can assure you i have no modified it : )
root@5de4c4a2be1f:/opt/CONCOCT/scripts# diff extract_fasta_bins_github.py extract_fasta_bins.py
2a3
>
24,27c25,32
< output_file = os.path.join(args.output_path, "{0}.fa".format(cluster_id))
< seqs = [all_seqs[contig_id] for contig_id in contig_ids]
< with open(output_file, 'w') as ofh:
< SeqIO.write(seqs, ofh, 'fasta')
---
> try:
> print(contig_ids)
> output_file = os.path.join(args.output_path, "{0}.fa".format(cluster_id))
> seqs = [all_seqs[contig_id] for contig_id in contig_ids]
> with open(output_file, 'w') as ofh:
> SeqIO.write(seqs, ofh, 'fasta')
> except:
> print(contig_id)
Hmm, strange. I cannot find any branch with that version of the file. The history of the main file does not show anything like that. How have you installed concoct? Did you use any non-standard branch?
I actually have notes since i've done this in a docker image...
nothing seems out of the ordinary
git clone https://github.com/BinPro/CONCOCT.git
cd CONCOCT
pip2 install -r requirements.txt
python setup.py install
nosetests
maybe i should git pull the latest version and try again?
i probably should have done this originally but (a) wasn't sure if this was a concoct or checkm issue and (b) i'd need to be very careful to not destroy my data reproducibliity
You could try to run the github version of the extract_fasta_bins.py
script and see if that makes a difference first?
new one doesn't work at all, in fact:
root@5de4c4a2be1f:/data/m/ea2# wget https://raw.githubusercontent.com/BinPro/CONCOCT/develop/scripts/extract_fasta_bins.py
c--2019-01-25 16:24:58-- https://raw.githubusercontent.com/BinPro/CONCOCT/develop/scripts/extract_fasta_bins.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... h151.101.112.133
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.112.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1299 (1.3K) [text/plain]
Saving to: 'extract_fasta_bins.py'
extract_fasta_bins.py 100%[========================================================================================================================================>] 1.27K --.-KB/s in 0s
2019-01-25 16:24:58 (106 MB/s) - 'extract_fasta_bins.py' saved [1299/1299]
root@5de4c4a2be1f:/data/m/ea2# chmod +x extract_fasta_bins.py
root@5de4c4a2be1f:/data/m/ea2# python e^C
root@5de4c4a2be1f:/data/m/ea2# ./extract_fasta_bins.py ^C
root@5de4c4a2be1f:/data/m/ea2# ./extract_fasta_bins.py megahit_output/final.contigs.fa ./concoct-output/clustering_merged.csv --output_path merged_fa_new
Traceback (most recent call last):
File "./extract_fasta_bins.py", line 39, in <module>
main(args)
File "./extract_fasta_bins.py", line 27, in main
seqs = [all_seqs[contig_id] for contig_id in contig_ids]
KeyError: 'contig_id'
root@5de4c4a2be1f:/data/m/ea2#
old one still works
Ok, that error is because of a mismatch between the new output format and the old one. The new script assumes there is not a header in the clustering file. So removing the first line of the clustering file should fix that particular error
this is going to drive you crazy....but there is no header in my files:
sorry, mean there isn't one in the "merged" file, but there is in the original file.
i'm only using the merged file
This is getting more and more exiting actually! Somehow the phrase 'contig_id' is added as a key in the contig_ids
list...
I will probably not have time to look at this until monday but could you export your environment with pip freeze
and share it here with me, please?
Otherwise if you're used to debugging python this is probably not that complicated. Somewhere here https://github.com/BinPro/CONCOCT/blob/c21349bdbe68139e82fa81dc7e14e5e8679a72a3/scripts/extract_fasta_bins.py#L18-L25 the 'contig_id' is introduced.
But yes, happy to check it out for you next week if you'd like me to.
well essentially, it should be pretty straight forward to just extract the 0 fasta on its own and add it to that directory, right?
i assume the current head of this repo doesn't have this issue, so i can upgrade and use it on all further samples without any problems?
I just don't want to have to re-do all downstream analysis for this set, as we've only noticed this discrepancy today and we've been working with the output for some time now
I'm afraid this seems to be an issue with the current head of the repo as well, as you just verified when trying the github version of the script? I'm suspecting it might be a difference between python 2 and python 3?
Yes, there is no magic happening in the fasta extraction script. If you're comfortable with programming you can write your own script to extract fasta files based on the clustering file.
For debugging and fixing this for other users however, it would greatly help me if I could see your python environment. Could you please supply the output of pip freeze
?
sure, see below.
if there is a general solution possible, i'm happy to help debug
root@5de4c4a2be1f:/data/m/ea2/concoct-output# python --version
Python 2.7.12
root@5de4c4a2be1f:/data/m/ea2/concoct-output# pip freeze
backports.functools-lru-cache==1.5
bcbio-gff==0.6.4
biopython==1.72
bz2file==0.98
certifi==2018.10.15
chardet==3.0.4
checkm-genome==1.0.14
Click==7.0
colormath==3.0.0
concoct==0.5.0
cutadapt==1.18
cycler==0.10.0
Cython==0.29
decorator==4.3.0
DendroPy==4.4.0
future==0.17.1
idna==2.7
Jinja2==2.10
kiwisolver==1.0.1
lzstring==1.0.4
Markdown==3.0.1
MarkupSafe==1.1.0
matplotlib==2.2.3
multiqc==1.6
mysqlclient==1.3.13
networkx==2.2
nose==1.3.7
numpy==1.15.4
pandas==0.23.4
pyparsing==2.3.0
pysam==0.15.1
python-dateutil==2.7.5
pytz==2018.7
PyYAML==3.13
requests==2.20.1
scikit-learn==0.20.0
scipy==1.1.0
scour==0.32
simplejson==3.16.0
six==1.10.0
spectra==0.0.11
subprocess32==3.5.3
urllib3==1.24.1
xopen==0.3.5
found the problem
this command:
/opt/toolbox/scripts/concoct/merge_cutup_clustering.py ./concoct-output/clustering_gt2000.csv > ./concoct-output/clustering_merged.csv
actually still produces the header, but sticks it somewhere in the middle of the output file:
such that when the extract command takes place and encounters this line, then the error "KeyError: 'contig_id'" is produced
removing contig,0 from the merged file allows the extract script to work without error
Oh, that's embarrassing! Thank you for pointing it out! It should be possible to remove any such header. I'll go ahead and close this issue now. Please let me know if you run into further issues.
This seems related to #240
I know that the developers of CONCOCT have used CheckM in the past to validate their results, so perhaps this is a new bug?
CONCOCT starts labeling BIN at 0 to:
and produces 29 bins:
When running checkm, only 28 bins are detected:![image](https://user-images.githubusercontent.com/9681868/51755299-9a882700-20be-11e9-97f1-a6c0798d2734.png)
And Bin 0 does not appear in the output:
as well there is no bin 0 in the bin directory:
I can confirm this happens with at least 2 different unrelated samples.
Is this intended or is there something about bin 0? is it a "garbage" bin not intended for analysis usage?