CAMI-challenge / OPAL

OPAL: Open-community Profiling Assessment tooL
https://cami-challenge.github.io/OPAL/
Apache License 2.0
25 stars 6 forks source link

Fail to generate fingures and HTML with self-made gold-standard profile without "strain" data #44

Closed shenwei356 closed 2 years ago

shenwei356 commented 2 years ago

Hi Fernando, thanks for this great tool!

I've used OPAL for some days, it worked smoothly. While today I met some error information with a self-made gold-standard profile for benchmarking.

Here are the profiles and outputs: test.tar.gz

Here's the log:

$ opal.py -v
opal.py 1.0.10

$ opal.py -g test_gs.profile test.profile  -f 1 -o test -l test 
2021-11-29 22:39:42,761 INFO Loading profiles...
2021-11-29 22:39:42,764 INFO done
2021-11-29 22:39:42,764 INFO Computing metrics...
2021-11-29 22:39:42,893 INFO done
2021-11-29 22:39:42,893 INFO Saving computed metrics...
Traceback (most recent call last):
  File "/home/shenwei/Public/app/miniconda3/envs/cami/bin/opal.py", line 394, in <module>
    main()
  File "/home/shenwei/Public/app/miniconda3/envs/cami/bin/opal.py", line 361, in main
    print_by_rank(output_dir, labels, pd_metrics)
  File "/home/shenwei/Public/app/miniconda3/envs/cami/bin/opal.py", line 81, in print_by_rank
    table = table.loc[pd.IndexSlice[order_rows,:], order_columns]
  File "/home/shenwei/Public/app/miniconda3/envs/cami/lib/python3.7/site-packages/pandas/core/indexing.py", line 889, in __getitem__
    return self._getitem_tuple(key)
  File "/home/shenwei/Public/app/miniconda3/envs/cami/lib/python3.7/site-packages/pandas/core/indexing.py", line 1060, in _getitem_tuple
    return self._getitem_lowerdim(tup)
  File "/home/shenwei/Public/app/miniconda3/envs/cami/lib/python3.7/site-packages/pandas/core/indexing.py", line 791, in _getitem_lowerdim
    return self._getitem_nested_tuple(tup)
  File "/home/shenwei/Public/app/miniconda3/envs/cami/lib/python3.7/site-packages/pandas/core/indexing.py", line 865, in _getitem_nested_tuple
    obj = getattr(obj, self.name)._getitem_axis(key, axis=axis)
  File "/home/shenwei/Public/app/miniconda3/envs/cami/lib/python3.7/site-packages/pandas/core/indexing.py", line 1113, in _getitem_axis
    return self._getitem_iterable(key, axis=axis)
  File "/home/shenwei/Public/app/miniconda3/envs/cami/lib/python3.7/site-packages/pandas/core/indexing.py", line 1053, in _getitem_iterable
    keyarr, indexer = self._get_listlike_indexer(key, axis, raise_missing=False)
  File "/home/shenwei/Public/app/miniconda3/envs/cami/lib/python3.7/site-packages/pandas/core/indexing.py", line 1266, in _get_listlike_indexer
    self._validate_read_indexer(keyarr, indexer, axis, raise_missing=raise_missing)
  File "/home/shenwei/Public/app/miniconda3/envs/cami/lib/python3.7/site-packages/pandas/core/indexing.py", line 1322, in _validate_read_indexer
    "Passing list-likes to .loc or [] with any missing labels "
KeyError: "Passing list-likes to .loc or [] with any missing labels is no longer supported. The following labels were missing: Index(['L1 norm error', 'Completeness', 'Purity', 'F1 score', 'True positives',\n       ...\n       'False positives (unfiltered)', 'False negatives (unfiltered)',\n       'Taxon counts (unfiltered)', 'Jaccard index (unfiltered)',\n       'Bray-Curtis distance (unfiltered)'],\n      dtype='object', name='metric', length=20). See https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#deprecate-loc-reindex-listlike"
shenwei356 commented 2 years ago

I know, the gold standard has no profile data of rank strain.

I try to remove |Strain from the header line @Ranks but it still failed to work. While it worked after adding some profile data of rank strain.

shenwei356 commented 2 years ago

I've added a few patches to make it work. But I'm not sure is this the best way, so I paste changes here, instead of create a PR.

Adding after: opa.py#L79

#L79  table = table.pivot_table(index=['tool', 'sample'], columns='metric', values='value')
if len(table.columns) < len(order_columns) + 2:
    continue

Adding after: html_opal.py#L373

#L373 sorted_columns = get_columns(labels, mydf_metrics.columns.tolist()) 
if len(mydf_metrics.columns) < len(sorted_columns) :
    continue

Adding after: plots.py#L292

#L292 for (sample1, sample2), value in profile_rank_to_sample12_to_braycurtis[rank].items():
    if sample1 not in gs_rank_to_sample12_to_braycurtis[rank]:
        continue
fernandomeyer commented 2 years ago

Thank you very much for reporting this issue, Wei Shen. As you said, it occurred if a profile contained strains but the gold standard profile did not.

This is now currently fixed in dev (9f1e161db8670e5089c14b7ee2eaa63104e3ff7c), similarly as you suggested. I just did not see a need to modify plots.py.

shenwei356 commented 2 years ago

Thank you for fixing it, looking forward to the new release!

shenwei356 commented 2 years ago

Hi Fernando, seems there's a newly introduced tiny bug in 9f1e161.

AttributeError: module 'src.utils.constants' has no attribute 'SUM_ABUNDANCES'

A line may be added in src/utils/constants.py

SUM_ABUNDANCES = 'Sum Abundance'
shenwei356 commented 2 years ago

Besides, for another custom gold standard profile without strain level, when testing with a profile same without strain. An error occurred:

2021-12-07 22:10:56,126 INFO Loading profiles...
2021-12-07 22:10:56,127 INFO done
2021-12-07 22:10:56,127 INFO Computing metrics...
2021-12-07 22:10:56,174 INFO done
2021-12-07 22:10:56,174 INFO Saving computed metrics...
2021-12-07 22:10:56,244 INFO done
2021-12-07 22:10:56,244 INFO Creating beta diversity plots...
2021-12-07 22:10:56,244 INFO done
2021-12-07 22:10:56,244 INFO Creating rarefaction curves...
2021-12-07 22:10:56,770 INFO done
2021-12-07 22:10:56,770 INFO Creating more plots...
Traceback (most recent call last):
  File "/home/shenwei/Public/app/miniconda3/envs/cami/bin/opal.py", line 426, in <module>
    main()
  File "/home/shenwei/Public/app/miniconda3/envs/cami/bin/opal.py", line 406, in main
    plots_list += pl.plot_all(pd_metrics, labels, output_dir, args.metrics_plot_rel, args.metrics_plot_abs)
  File "/home/shenwei/Public/app/miniconda3/envs/cami/lib/python3.7/site-packages/src/plots.py", line 605, in plot_all
    plot_purity_completeness_per_tool_and_rank(pd_grouped, pd_mean, output_dir)
  File "/home/shenwei/Public/app/miniconda3/envs/cami/lib/python3.7/site-packages/src/plots.py", line 487, in plot_purity_completeness_per_tool_and_rank
    pd_mean = pd_mean.drop(c.GS, level='tool').drop(['strain', 'rank independent'], level='rank')
  File "/home/shenwei/Public/app/miniconda3/envs/cami/lib/python3.7/site-packages/pandas/util/_decorators.py", line 311, in wrapper
    return func(*args, **kwargs)
  File "/home/shenwei/Public/app/miniconda3/envs/cami/lib/python3.7/site-packages/pandas/core/frame.py", line 4913, in drop
    errors=errors,
  File "/home/shenwei/Public/app/miniconda3/envs/cami/lib/python3.7/site-packages/pandas/core/generic.py", line 4150, in drop
    obj = obj._drop_axis(labels, axis, level=level, errors=errors)
  File "/home/shenwei/Public/app/miniconda3/envs/cami/lib/python3.7/site-packages/pandas/core/generic.py", line 4183, in _drop_axis
    new_axis = axis.drop(labels, level=level, errors=errors)
  File "/home/shenwei/Public/app/miniconda3/envs/cami/lib/python3.7/site-packages/pandas/core/indexes/multi.py", line 2190, in drop
    return self._drop_from_level(codes, level, errors)
  File "/home/shenwei/Public/app/miniconda3/envs/cami/lib/python3.7/site-packages/pandas/core/indexes/multi.py", line 2242, in _drop_from_level
    raise KeyError(f"labels {not_found} not found in level")
KeyError: "labels ['strain'] not found in level"

Profiles: both_without_strain.tar.gz

It's easy to fix:

# edit src/plots.py  line 487
# pd_mean = pd_mean.drop(c.GS, level='tool').drop(['strain', 'rank independent'], level='rank')
pd_mean = pd_mean.drop(c.GS, level='tool').drop(['strain', 'rank independent'], level='rank', errors='ignore')
fernandomeyer commented 2 years ago

Hi Wei Shen, these issues have already been fixed in the dev branch.

shenwei356 commented 2 years ago

Oh, I see, I didn't notice the branch. Now it works. Thank you.

shenwei356 commented 2 years ago

Sorry, I have to report another bug that occurred after updating the profiles, where some of Bray-Curtis distance are numbers while some are na.

For this line https://github.com/CAMI-challenge/OPAL/blob/dev/src/plots.py#L294 , (sample1, sample2) may be missing in gs_rank_to_sample12_to_braycurtis[rank]. Yes, the rank is strain.

gs_values.append(gs_rank_to_sample12_to_braycurtis[rank][(sample1, sample2)])

Needs to check existence.

if (sample1, sample2) not in gs_rank_to_sample12_to_braycurtis[rank]:
      continue
gs_values.append(gs_rank_to_sample12_to_braycurtis[rank][(sample1, sample2)])

Because the gold standard does not contain data of strain, I would suggest skipping compute all metrics of rank stran.

fernandomeyer commented 2 years ago

Could you please provide your test case? I'm not getting an error. I appreciate your reporting this, thank you very much. @shenwei356

ScaonE commented 2 years ago

Thank you very much for reporting this issue, Wei Shen. As you said, it occurred if a profile contained strains but the gold standard profile did not.

This is now currently fixed in dev (9f1e161), similarly as you suggested. I just did not see a need to modify plots.py.

Hello,

I seem to encounter the same thing (using opal.py 1.0.10 too). What's the proper way to install OPAL from the dev branch?

shenwei356 commented 2 years ago

Hi @fernandomeyer, I have another question related to this issue.

The Weighted UniFrac distance ranges between 0 and twice the height of the taxonomic tree used.

How's the Weighted UniFrac distance/error affected for cases below?

  1. The gold standard has data at strain, while a tool does not report result at strain.
  2. The gold standard does not have data at strain, while a tool report result at strain.
fernandomeyer commented 2 years ago

@shenwei356 The weighted UniFrac would be penalized in both cases. The taxonomic tree that is used is the union of the taxonomic trees represented in the ground truth and the sample. So if either has strains present and the other doesn’t, this mass/abundance would need to be moved up the tree in order to overlap (credits to @dkoslicki for the answer).

shenwei356 commented 2 years ago

For the 2nd case:

The gold standard does not have data at strain, while a tool report result at strain.

I think the profiler should not be penalized at the strain ranks :(

Counting from the lowest common ancester may be a better option.

dkoslicki commented 2 years ago

@shenwei356 I think that case could be interpreted as an incorrect ground truth then. I.e. a tool has predicted the presence of a strain that is absent in the ground truth, so this would be equivalent to a tool predicting the presence of any other taxa that isn't present in the ground truth, and is treated as such. Note, in a different branch there's a branch length function that would cause penalties to decrease the further down in the rank you go.

Simple fix would be to change the ground truth accordingly

shenwei356 commented 2 years ago

Thanks, David. But I don't think changing the ground truth is a good way, cause we can't conjecture the strain(s) from a species.

At the moment, I think there might be a compromise to remove the predictions of the strain level.

fernandomeyer commented 2 years ago

Issue has been fixed and included in release v1.0.11.