biocore / songbird

Vanilla regression methods for microbiome differential abundance analysis
BSD 3-Clause "New" or "Revised" License
54 stars 25 forks source link

Drop duplicates in metadata #85

Closed mortonjt closed 4 years ago

mortonjt commented 4 years ago

Getting this error

Traceback (most recent call last):
  File "/Users/jmorton/miniconda3/envs/qiime2-2019.7/bin/songbird", line 113, in <module>
    songbird()
  File "/Users/jmorton/miniconda3/envs/qiime2-2019.7/lib/python3.6/site-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/Users/jmorton/miniconda3/envs/qiime2-2019.7/lib/python3.6/site-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/Users/jmorton/miniconda3/envs/qiime2-2019.7/lib/python3.6/site-packages/click/core.py", line 1137, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Users/jmorton/miniconda3/envs/qiime2-2019.7/lib/python3.6/site-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/jmorton/miniconda3/envs/qiime2-2019.7/lib/python3.6/site-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/Users/jmorton/miniconda3/envs/qiime2-2019.7/bin/songbird", line 76, in multinomial
    min_sample_count, min_feature_count)
  File "/Users/jmorton/miniconda3/envs/qiime2-2019.7/lib/python3.6/site-packages/songbird/util.py", line 172, in match_and_filter
    table = table.sort(sort_f=sort_f, axis='sample')
  File "/Users/jmorton/miniconda3/envs/qiime2-2019.7/lib/python3.6/site-packages/biom/table.py", line 2123, in sort
    return self.sort_order(sort_f(self.ids(axis=axis)), axis=axis)
  File "/Users/jmorton/miniconda3/envs/qiime2-2019.7/lib/python3.6/site-packages/songbird/util.py", line 170, in sort_f
    return [xs[metadata.index.get_loc(x)] for x in xs]
  File "/Users/jmorton/miniconda3/envs/qiime2-2019.7/lib/python3.6/site-packages/songbird/util.py", line 170, in <listcomp>
    return [xs[metadata.index.get_loc(x)] for x in xs]
IndexError: boolean index did not match indexed array along dimension 0; dimension is 176 but corresponding boolean dimension is 196

This is because there are duplicate metadata entries.

Not sure if this sort of stuff is being caught in q2 metadata. But we probably want to catch this in the standalone api.

We can add a fix to the following line https://github.com/biocore/songbird/blob/master/songbird/util.py#L168

metadata = metadata.loc[~metadata.index.duplicated(keep='first')]