adriantich / DnoisE

Distance denoise by Entropy
GNU General Public License v3.0
12 stars 3 forks source link

ValueError when writing output_ratio_d #27

Closed hempelc closed 10 months ago

hempelc commented 10 months ago

Dear Adri,

I've been trying to run DnoisE, and the denoising step works as expected:

denoising dataset of 13727 sequences
100%|█████████████████| 13726/13726 [01:37<00:00, 140.77it/s]

However, I run into the following error afterwards:

writing output_ratio_d
  1%|▏                       | 1/102 [00:00<00:03, 27.42it/s]
Traceback (most recent call last):
  File "/Users/christopherhempel/mambaforge/envs/dnoise/bin/dnoise", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/Users/christopherhempel/mambaforge/envs/dnoise/lib/python3.11/site-packages/dnoise/DnoisE.py", line 50, in main
    run_denoise(de)
  File "/Users/christopherhempel/mambaforge/envs/dnoise/lib/python3.11/site-packages/dnoise/running_denoise.py", line 159, in run_denoise
    row = pd.Series(row[0], index=[de.first_col_names + de.abund_col_names + [de.seq]][0])
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/christopherhempel/mambaforge/envs/dnoise/lib/python3.11/site-packages/pandas/core/series.py", line 500, in __init__
    com.require_length_match(data, index)
  File "/Users/christopherhempel/mambaforge/envs/dnoise/lib/python3.11/site-packages/pandas/core/common.py", line 576, in require_length_match
    raise ValueError(
ValueError: Length of values (7) does not match length of index (3)

At first, I thought this was caused by some pandas version incompatibility, and for the last couple of days, I have played around with all sorts of DnoisE installations, via conda, mamba, installation via install.sh, manually, with the executable, and without the executable. But I always get the exact same error. Here are my installed packages (in mamba):

dependencies:
  - bzip2=1.0.8
  - ca-certificates=2023.7.22
  - levenshtein=0.21.0
  - libblas=3.9.0
  - libcblas=3.9.0
  - libcxx=16.0.6
  - libexpat=2.5.0
  - libffi=3.4.2
  - libgfortran=5.0.0
  - libgfortran5=13.2.0
  - liblapack=3.9.0
  - libopenblas=0.3.24
  - libsqlite=3.43.2
  - libzlib=1.2.13
  - llvm-openmp=17.0.3
  - ncurses=6.4
  - numpy=1.26.0
  - openssl=3.1.3
  - pip=23.3.1
  - python=3.11.0
  - python_abi=3.11
  - rapidfuzz=2.15.2
  - readline=8.2
  - setuptools=68.2.2
  - tk=8.6.13
  - wheel=0.41.2
  - xz=5.2.6
  - pip:
      - dnoise==1.4
      - pandas==2.0.0
      - python-dateutil==2.8.2
      - pytz==2023.3.post1
      - six==1.16.0
      - tqdm==4.66.1
      - tzdata==2023.3

Do you have any idea what could be going on?

Thanks so much for your help! Chris

hempelc commented 10 months ago

Update: it worked fine with a different FASTA file, so it seems to be related to the format of the FASTA file I initially used. At first glance, the format seems to be fine (sequence ID followed by ';' followed by size=XXX), but something else must be going on. I will dig into this and report what I find!

hempelc commented 10 months ago

Okay, I think I found the issue. My FASTA file contains many sequences with identical sequence IDs, like so (just IDs + size shown):

>seq:9;size=11961
>seq:9;size=1348
>seq:9;size=14151

When I removed sequences with duplicate IDs, DnoisE worked. These ID duplicates in my FASTA file must be an error in the processing pipeline I used. So, the issue is resolved for me, but maybe you could consider adding a sanity check for duplicate IDs, just food for thought!

adriantich commented 10 months ago

Nice you find out!