BinPro / CONCOCT

Clustering cONtigs with COverage and ComposiTion
Other
119 stars 48 forks source link

panda reindex errors #278

Closed josieparis closed 4 years ago

josieparis commented 4 years ago

Hi!!

I am trying to run concoct following the manual as a first pass:

Run concoct: concoct --composition_file contigs_10K.fa --coverage_file coverage_table.tsv -b concoct_output/

But am gettting the following panda errors:

Traceback (most recent call last):
  File "/gpfs/ts0/home/jrp228/.local/bin/concoct", line 4, in <module>
    __import__('pkg_resources').run_script('concoct==1.1.0', 'concoct')
  File "/gpfs/ts0/shared/software/Python/3.6.4-foss-2018a/lib/python3.6/site-packages/pkg_resources/__init__.py", line 743, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/gpfs/ts0/shared/software/Python/3.6.4-foss-2018a/lib/python3.6/site-packages/pkg_resources/__init__.py", line 1498, in run_script
    exec(code, namespace, namespace)
  File "/gpfs/ts0/home/jrp228/.local/lib/python3.6/site-packages/concoct-1.1.0-py3.6-linux-x86_64.egg/EGG-INFO/scripts/concoct", line 90, in <module>
    results = main(args)
  File "/gpfs/ts0/home/jrp228/.local/lib/python3.6/site-packages/concoct-1.1.0-py3.6-linux-x86_64.egg/EGG-INFO/scripts/concoct", line 20, in main
    composition, cov, cov_range = load_data(args)
  File "/gpfs/ts0/home/jrp228/.local/lib/python3.6/site-packages/concoct-1.1.0-py3.6-linux-x86_64.egg/concoct/input.py", line 25, in load_data
    read_length = args.read_length
  File "/gpfs/ts0/home/jrp228/.local/lib/python3.6/site-packages/concoct-1.1.0-py3.6-linux-x86_64.egg/concoct/input.py", line 92, in load_coverage
    axis='index')
  File "/gpfs/ts0/home/jrp228/.local/lib/python3.6/site-packages/pandas/core/ops.py", line 2030, in f
    level=level)
  File "/gpfs/ts0/home/jrp228/.local/lib/python3.6/site-packages/pandas/core/ops.py", line 1917, in _combine_series_frame
    return self._combine_match_index(other, func, level=level)
  File "/gpfs/ts0/home/jrp228/.local/lib/python3.6/site-packages/pandas/core/frame.py", line 5097, in _combine_match_index
    copy=False)
  File "/gpfs/ts0/home/jrp228/.local/lib/python3.6/site-packages/pandas/core/frame.py", line 3792, in align
    broadcast_axis=broadcast_axis)
  File "/gpfs/ts0/home/jrp228/.local/lib/python3.6/site-packages/pandas/core/generic.py", line 8428, in align
    fill_axis=fill_axis)
  File "/gpfs/ts0/home/jrp228/.local/lib/python3.6/site-packages/pandas/core/generic.py", line 8514, in _align_series
    fdata = fdata.reindex_indexer(join_index, lidx, axis=1)
  File "/gpfs/ts0/home/jrp228/.local/lib/python3.6/site-packages/pandas/core/internals/managers.py", line 1224, in reindex_indexer
    self.axes[axis]._can_reindex(indexer)
  File "/gpfs/ts0/home/jrp228/.local/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 3087, in _can_reindex
    raise ValueError("cannot reindex from a duplicate axis")
ValueError: cannot reindex from a duplicate axis

I noticed that the latest code was update was to "Fix pandas errors and warnings - in future perhaps drop pandas", is this related? Any help on this greatly appreciated!

FYI, I am on python/3.6.4 and have checked all the python requirements, and all are ok

Thanks!

alneberg commented 4 years ago

Hi @josieparis!

I don't know exactly what causes this error but since it occurs in the load_coverage method, I would suspect that the coverage file might not have the correct format. Could you perhaps post the first few lines of it?

You could try to recreate the coverage file if you still have the bam files around with the command:

concoct_coverage_table.py contigs_10K.bed mapping/Sample*.sorted.bam > coverage_table.tsv

josieparis commented 4 years ago

Thanks @alneberg!

Here's the head of my coverage file:

contig  cov_mean_sample_GH10_F  cov_mean_sample_GH11_M  cov_mean_sample_GH12_M  cov_mean_sample_GH13_M  cov_mean_sample_GH14_M  cov_mean_sample_GH15_M  cov_mean_sample_GH17_M  cov_mean_sample_GH18_M  cov_mean_sample_GH19_M  cov_mean_sample_GH1_F   cov_mean_sample_GH20_M  cov_mean_sample_GH2_F   cov_mean_sample_GH3_F   cov_mean_sample_GH4_F   cov_mean_sample_GH5_F   cov_mean_sample_GH6_F   cov_mean_sample_GH7_F   cov_mean_sample_GH8_F   cov_mean_sample_GH9_F
000083F_0.2.concoct_part_0  10.537  28.038  13.885  81.840  31.495  28.808  24.514  29.394  26.686  12.754  59.409  5.700   8.709   9.135   12.563  11.569  9.933   7.414   9.066
000083F_0.2.concoct_part_1  6.691   24.077  8.822   41.119  14.686  17.568  19.401  16.255  16.865  7.707   39.888  3.196   4.637   5.058   6.742   6.753   4.731   4.004   4.699
000083F_0.2.concoct_part_2  5.881   8.118   3.927   8.017   7.652   5.508   4.944   5.071   7.262   5.324   15.045  2.930   4.117   4.124   4.418   4.830   3.845   3.029   4.149
000083F_0.2.concoct_part_3  3.753   5.768   4.537   11.132  12.330  3.639   3.738   4.306   4.683   4.304   12.735  1.568   3.054   2.698   3.235   3.423   2.798   2.196   2.080
000083F_0.2.concoct_part_4  3.272   7.734   2.575   7.146   11.091  2.109   2.268   4.774   2.728   5.543   19.151  0.740   4.421   2.907   4.248   3.498   3.904   2.285   2.759

I wondered if it might be the numbers in the scaffold names as I'm aware that this is a known issue, but I ran a test changing the digits to letters, and unfortunately, the error reoccurs.

I also tried rerunning concoct_coverage_table.py but the coverage file is the same ...

Thanks for your help with this!

alneberg commented 4 years ago

Hmm, still no clear view of what the issue is caused by. Googling the error, I found this.

Interpreting the answer, I suspect there might be duplicate contig names in either the coverage file or the contigs_10K.bed. Could you check that?

josieparis commented 4 years ago

hurray!! Yes, for some reason, both my contigs_10k.bed and coverage table had triplicate values! Strange ... I must have appended data at some point ...

Thanks again @alneberg, really appreciate the help