Closed fedarko closed 3 years ago
Hierarchical metadata should certainly be treated differently. This makes sense although I think we'll need to "special case" this to columns labeled as Level *
right? And also be very clear to the user about what's going on.
Agreed. What should help a bit with the special casing is that the Level *
columns are created by EMPress' Python code (in this module), so we should be able to pass some information about these columns specifically to the JS code saying something like "Hey this is a taxonomy metadata column, look at its ancestors". (I don't think it will be feasible to detect already-split-up taxonomy columns and try to disambiguate them -- that sounds like it would be challenging and prone to errors.)
One solution that should work, with the added benefit of only modifying the Python code: we automatically could go through the split-up taxonomy (produced by the code linked above) and identify all of the ambiguous values in each level (so both the s__
cases in Level 7, and also stuff like g__
in Level 6, Unspecified
at any level, etc.) We should then be able to add text to these values (from the higher levels) to disambiguate these levels as needed: so, for example, for Level 7 this would probably involve prefixing each s__
with the respective genus label, producing things like g__Streptococcus; s__
as the metadata values where needed. This would cause distinct colors to be assigned to each taxonomy value, as well as avoid ambiguous values from being treated as equal during propagation. (This would increase the metadata size somewhat, but speedups like #337 should offset this at least partially.)
I brought this up in https://github.com/biocore/empress/issues/471#issuecomment-749300846, but for the sake of everyone's sanity this should probably be its own issue. This isn't a very pressing issue (the only common case it should run into is stuff like
s__
for Level 7), but there are corner-cases where real levels of taxa can have the same name but different ancestors. In these cases, we should assign these values distinct colors and not treat them as "equal" when doing propagation.