Closed shenwei356 closed 2 years ago
I know, the gold standard has no profile data of rank strain
.
I try to remove |Strain
from the header line @Ranks
but it still failed to work. While it worked after adding some profile data of rank strain
.
I've added a few patches to make it work. But I'm not sure is this the best way, so I paste changes here, instead of create a PR.
Adding after: opa.py#L79
#L79 table = table.pivot_table(index=['tool', 'sample'], columns='metric', values='value')
if len(table.columns) < len(order_columns) + 2:
continue
Adding after: html_opal.py#L373
#L373 sorted_columns = get_columns(labels, mydf_metrics.columns.tolist())
if len(mydf_metrics.columns) < len(sorted_columns) :
continue
Adding after: plots.py#L292
#L292 for (sample1, sample2), value in profile_rank_to_sample12_to_braycurtis[rank].items():
if sample1 not in gs_rank_to_sample12_to_braycurtis[rank]:
continue
Thank you very much for reporting this issue, Wei Shen. As you said, it occurred if a profile contained strains but the gold standard profile did not.
This is now currently fixed in dev (9f1e161db8670e5089c14b7ee2eaa63104e3ff7c), similarly as you suggested. I just did not see a need to modify plots.py.
Thank you for fixing it, looking forward to the new release!
Hi Fernando, seems there's a newly introduced tiny bug in 9f1e161.
AttributeError: module 'src.utils.constants' has no attribute 'SUM_ABUNDANCES'
A line may be added in src/utils/constants.py
SUM_ABUNDANCES = 'Sum Abundance'
Besides, for another custom gold standard profile without strain
level, when testing with a profile same without strain
. An error occurred:
2021-12-07 22:10:56,126 INFO Loading profiles...
2021-12-07 22:10:56,127 INFO done
2021-12-07 22:10:56,127 INFO Computing metrics...
2021-12-07 22:10:56,174 INFO done
2021-12-07 22:10:56,174 INFO Saving computed metrics...
2021-12-07 22:10:56,244 INFO done
2021-12-07 22:10:56,244 INFO Creating beta diversity plots...
2021-12-07 22:10:56,244 INFO done
2021-12-07 22:10:56,244 INFO Creating rarefaction curves...
2021-12-07 22:10:56,770 INFO done
2021-12-07 22:10:56,770 INFO Creating more plots...
Traceback (most recent call last):
File "/home/shenwei/Public/app/miniconda3/envs/cami/bin/opal.py", line 426, in <module>
main()
File "/home/shenwei/Public/app/miniconda3/envs/cami/bin/opal.py", line 406, in main
plots_list += pl.plot_all(pd_metrics, labels, output_dir, args.metrics_plot_rel, args.metrics_plot_abs)
File "/home/shenwei/Public/app/miniconda3/envs/cami/lib/python3.7/site-packages/src/plots.py", line 605, in plot_all
plot_purity_completeness_per_tool_and_rank(pd_grouped, pd_mean, output_dir)
File "/home/shenwei/Public/app/miniconda3/envs/cami/lib/python3.7/site-packages/src/plots.py", line 487, in plot_purity_completeness_per_tool_and_rank
pd_mean = pd_mean.drop(c.GS, level='tool').drop(['strain', 'rank independent'], level='rank')
File "/home/shenwei/Public/app/miniconda3/envs/cami/lib/python3.7/site-packages/pandas/util/_decorators.py", line 311, in wrapper
return func(*args, **kwargs)
File "/home/shenwei/Public/app/miniconda3/envs/cami/lib/python3.7/site-packages/pandas/core/frame.py", line 4913, in drop
errors=errors,
File "/home/shenwei/Public/app/miniconda3/envs/cami/lib/python3.7/site-packages/pandas/core/generic.py", line 4150, in drop
obj = obj._drop_axis(labels, axis, level=level, errors=errors)
File "/home/shenwei/Public/app/miniconda3/envs/cami/lib/python3.7/site-packages/pandas/core/generic.py", line 4183, in _drop_axis
new_axis = axis.drop(labels, level=level, errors=errors)
File "/home/shenwei/Public/app/miniconda3/envs/cami/lib/python3.7/site-packages/pandas/core/indexes/multi.py", line 2190, in drop
return self._drop_from_level(codes, level, errors)
File "/home/shenwei/Public/app/miniconda3/envs/cami/lib/python3.7/site-packages/pandas/core/indexes/multi.py", line 2242, in _drop_from_level
raise KeyError(f"labels {not_found} not found in level")
KeyError: "labels ['strain'] not found in level"
Profiles: both_without_strain.tar.gz
It's easy to fix:
# edit src/plots.py line 487
# pd_mean = pd_mean.drop(c.GS, level='tool').drop(['strain', 'rank independent'], level='rank')
pd_mean = pd_mean.drop(c.GS, level='tool').drop(['strain', 'rank independent'], level='rank', errors='ignore')
Hi Wei Shen, these issues have already been fixed in the dev
branch.
Oh, I see, I didn't notice the branch. Now it works. Thank you.
Sorry, I have to report another bug that occurred after updating the profiles, where some of Bray-Curtis distance
are numbers while some are na
.
For this line https://github.com/CAMI-challenge/OPAL/blob/dev/src/plots.py#L294 , (sample1, sample2)
may be missing in gs_rank_to_sample12_to_braycurtis[rank]
. Yes, the rank is strain
.
gs_values.append(gs_rank_to_sample12_to_braycurtis[rank][(sample1, sample2)])
Needs to check existence.
if (sample1, sample2) not in gs_rank_to_sample12_to_braycurtis[rank]:
continue
gs_values.append(gs_rank_to_sample12_to_braycurtis[rank][(sample1, sample2)])
Because the gold standard does not contain data of strain
, I would suggest skipping compute all metrics of rank stran
.
Could you please provide your test case? I'm not getting an error. I appreciate your reporting this, thank you very much. @shenwei356
Thank you very much for reporting this issue, Wei Shen. As you said, it occurred if a profile contained strains but the gold standard profile did not.
This is now currently fixed in dev (9f1e161), similarly as you suggested. I just did not see a need to modify plots.py.
Hello,
I seem to encounter the same thing (using opal.py 1.0.10 too). What's the proper way to install OPAL from the dev branch?
Hi @fernandomeyer, I have another question related to this issue.
The Weighted UniFrac distance ranges between 0 and twice the height of the taxonomic tree used.
How's the Weighted UniFrac distance/error affected for cases below?
@shenwei356 The weighted UniFrac would be penalized in both cases. The taxonomic tree that is used is the union of the taxonomic trees represented in the ground truth and the sample. So if either has strains present and the other doesn’t, this mass/abundance would need to be moved up the tree in order to overlap (credits to @dkoslicki for the answer).
For the 2nd case:
The gold standard does not have data at strain, while a tool report result at strain.
I think the profiler should not be penalized at the strain ranks :(
Counting from the lowest common ancester may be a better option.
@shenwei356 I think that case could be interpreted as an incorrect ground truth then. I.e. a tool has predicted the presence of a strain that is absent in the ground truth, so this would be equivalent to a tool predicting the presence of any other taxa that isn't present in the ground truth, and is treated as such. Note, in a different branch there's a branch length function that would cause penalties to decrease the further down in the rank you go.
Simple fix would be to change the ground truth accordingly
Thanks, David. But I don't think changing the ground truth is a good way, cause we can't conjecture the strain(s) from a species.
At the moment, I think there might be a compromise to remove the predictions of the strain level.
Issue has been fixed and included in release v1.0.11.
Hi Fernando, thanks for this great tool!
I've used OPAL for some days, it worked smoothly. While today I met some error information with a self-made gold-standard profile for benchmarking.
Here are the profiles and outputs: test.tar.gz
Here's the log: