biocore / microprot

structural annotation pipeline for microbial genomes and metagenomes
BSD 3-Clause "New" or "Revised" License
1 stars 6 forks source link

BUG: calculate_Neff crashes if duplicate entries in DissimilarityMatrix #53

Closed tkosciol closed 7 years ago

tkosciol commented 7 years ago

It's not calculate_Neff error directly, but given the development cycle for skbio, we need to find a workaround here first.

If there are duplicate headers in MSA file (which will happen, because (1) HHsuite trims headers to a number of characters, (2) we'll be running HHblits against 2 databases which may contain duplicate entries) it gives an error:

DissimilarityMatrixError in line 270 of /projects/microprot/microprot/snakemake/Snakefile:
IDs must be unique. Found the following duplicate IDs: 'tr|A0A158EL77|A0A158EL77_9BURK'
  File "/projects/microprot/microprot/snakemake/Snakefile", line 270, in __rule_MSA_ripe
  File "/projects/microprot/microprot/scripts/calculate_Neff.py", line 47, in hamming_distance_matrix
  File "/home/tkosciolek/conda/envs/microprot/lib/python3.5/site-packages/skbio/stats/distance/_base.py", line 795, in from_iterable
  File "/home/tkosciolek/conda/envs/microprot/lib/python3.5/site-packages/skbio/stats/distance/_base.py", line 107, in __init__
  File "/home/tkosciolek/conda/envs/microprot/lib/python3.5/site-packages/skbio/stats/distance/_base.py", line 868, in _validate
  File "/home/tkosciolek/conda/envs/microprot/lib/python3.5/site-packages/skbio/stats/distance/_base.py", line 683, in _validate
  File "/home/tkosciolek/conda/envs/microprot/lib/python3.5/concurrent/futures/thread.py", line 55, in run[0m

One workaround would be to get rid of headers whatsoever because we're just interested in the number of sequences anyway. Another option is to prefilter MSA for redundant sequences (if that is the case), before calculating the distance matrix.

tkosciol commented 7 years ago

65% of errors I'm getting in the pipeline are because of this issue, so it's should be high priority. Remaining 35% are most likely scheduler problems generating I/O conflicts which I need to discuss with Jeff.

tkosciol commented 7 years ago

example on Barnacle in: /projects/microprot/benchmarking/snakemake_test/MSA_ripe_error @sjanssen2

sjanssen2 commented 7 years ago

I don't like the pre-filtering, because that would change the value of Neff

tkosciol commented 7 years ago

solved by PR #54