ecotaxa / ecotaxa_front

Front end of the EcoTaxa application
Other
6 stars 6 forks source link

Homogenise taxonomic sublabels such as male, female, part, etc. #456

Open jiho opened 4 years ago

jiho commented 4 years ago

We have plenty of such cases: male, female, juvenile, with eggs, part, etc. We currently create a child of the parent taxon with a name that tries to follow conventions

\_ Copepoda
    \_ ovigerous

to label Copepods with eggs

\_ Crustacea
    \_ part

to label broken bits of Crustaceans etc.

This works well because part and ovigerous are true taxa (with an ID, a parent, etc.) so they can be assigned like any other taxon (no change in UI and backend functions), can be aggregated together with the parent for scientific analysis when it makes sense, and the display convention for taxonomic name (add the parent when the name is duplicated) means that they actually appear as ovigerous<Copepoda, and part<Crustacea which makes sense.

So I like this solution.

What I don't like is that the consistency of the naming is left to the users creating taxa. Therefore, instead of implementing a tag system separate from the taxonomy (which is a pain to assign them, combine them, aggregate them etc.), I suggest making a function that allows to easily create a children category with a pre-defined number of names (the ones above and a few others).

Now the discussion is where to implement this. My guess is that, in the taxo creation modal there should be an obvious button/list next to "Name" that says "Predefined" and shows the list of common sublabels and then a text area called "Custom" which allows to type a name. When a predefined one is selected, the type is made "Morpho".

rkiko commented 4 years ago

Need to think more about this, I guess. Would you really want a category "copepod w eggsack full lipid sack full gut" and how do you enforce the proper order of this?

moi90 commented 4 years ago

It is certainly possible to do it this way. However, this will flood the tree with categories that are actually attributes because you will need these attribute categories in many levels of the tree, and it will be hard to achieve consistency - even if you provide predefined attributes. Finally, it gets even worse with multiple attributes (like @rkiko said).

In the long term (maybe EcoTaxa 3.0;)), it might be better to transition to a annotation scheme with a primary, phylogenetic, identification and secondary attributes. In the UI, these attributes could still be displayed as (virtual) children of a taxonomic node or maybe there is still a better way.

Attributes separated from identifications would also solve the problem of detritus classification. There would be no need to decide whether it should be "light/compact" or "compact/light", there would be just fiber, feces, etc with possible attributes.

Lastly, attributes solve the problem of validated-ness: If an object is in "copepoda" but not in "copepoda/ovigerous", does this mean that it is without eggs or is it just that no one bothered to put it into the more specific category? With attributes, you could make this explicit, an object would be related to an attribute in three ways: not annotated, positive (with eggs), negative (explicitely without eggs).

grololo06 commented 4 years ago

Attributes are clearly the way to go. From a semantic point of view, "Oithonidae > Oithona" relation is not the same as "Oithona > female+eggs". It's a hack of the concept. As the proverb says "When all you have is a hammer, everything looks like a nail" :)

jiho commented 4 years ago

I understand and agree that attributes are a more elegant design. That said...

So, for all intents and purposes, attributes would need to be nested within taxa and work like a separate taxon, from the point of view of the user.

But we could argue that we could still code them differently. Why would we want to do so?

So, overall, coding these things as attributes would complexify the code quite a lot (in every place where something is done per taxon, it needs to be done per taxon and combination of attributes); from the point of view of the user, they will need to be made to appear as separate taxa (so there is actually more work there); in some places we may need to have them encoded in some sort of hierarchy (and we would need a separate system for that, while we already have a taxonomy).

On the other hand, the cost of implementing them as sub-taxa is more entries in the taxonomy; then everything else works. But this will constitute a few thousand, maybe ten of thousands of entries, which is negligible compare to the size of the tree of life.

I am usually the one arguing for making things right rather than easy. But here I would vote for practicality. In addition, it is not completely wrong either: the hierarchical taxonomy if a center piece of EcoTaxa (it is right in the name), so making things fit within that taxonomy is meaningful.

PS: I could even go on about the fact that the phylogenetic divisions into family, genus, species etc. are somewhat arbitrary and that or morphological divisions sometimes make as much sense as the phylogenetic ones, that "species" is an overrated concept etc. but this reply is long enough as it is.

rkiko commented 4 years ago

I think you are mixing two things: how to aggregate the data and how to do the assignment of the attributes. I think most of what you write clearly is in favour of attributes, although it will become more and more difficult to switch from the current way of how things are done to an attribute assignment approach, as datasets grow ...

Data aggregation should be no problem, one just has to code it. But if you are interested to know the ratio of copepod females with eggs w. no-eggs, you will with attributes just ask for data that has these attributes assigned (female, not at female; w. eggs, wo eggs, can not tell; assuming that within the dataset you are querying these have been assigned consistently). No attribute means no data. Currently, you have to be lucky that the user has written "female" and "eggs" correctly...

Also, different users are interested in different attributes. This leads to problems when merging data... Even worse, how to merge datasets in which people have noted gut fullness and then egg carrying, vs other way round??? copepod w full gut and egg sack vs copepod w egg sack and full gut??? Total nightmare ahead now that we are starting to ask these questions ... Even with the UVP data we do not have consistent sorting across projects. It will just get worse without attributes ...

We would need an UI to assign attributes in the classif page. Having a separate process to assign attributes and taxa will be a pain for users.

I don't think so. I think it will be easier to sort, if you do not have to worry about the attributes in the first round of sorting. I think that a user will assign attributes if important for his/her task,question. So, it is a secondary step after identifying the general class of the organism.

E.g. egg-carying copepods. You would then want to enter a 'attribute-assignment stage' where you are shown all copepods from all subclasses. You could then first search for all females, mark them and assign the corresponding label by hitting a button. The others get the 'no-female label', as you have checked them. Second step is to find the egg-carying females. Hit another button to assign w. eggs, all others get wo eggs., done. This could be done on the same ecotaxa page, but I think designing an own stage might be simpler...

Another comment: I think attributes are a quality of an object, above you write that "part" could be an attribute. I use 'part' to signify that this is not the complete organism, e.g. a bitten off tail or a lost antenna. This is not an attribute. It is not the complete organism and therefore it is a class.

Also too long for a Sunday ;)

Just read another thing that is mA wrong in your concept:

But then, if one just sums per sample, the result is wrong because some objects are counted twice or thrice, which is very bad(™). For numbers to be summable however we cut the data (which we absolutely need), we need to make mutually exclusive categories like this

Copepoda = number of copepods with no attributes Copepoda+with eggs = all copepods with only the with eggs attribute Copepoda+with lipid sac = all copepods with only the with lipid sac attribute Copepoda+with eggs+with lipid sac = copepods with both attributes

You would define which attributes you want to use for your study before aggregation, not try to aggregate everything. And you should get an output that shows also the objects where this attribute was not assigned. In your case and if you are interested in eggs you have three types of data:

no egg attribute assigned
egg attribute positive
egg attribute negative

That someone assigned the lipid sac attribute does not mean that he/she checked the egg attribute ... Clearly, to do quantitative work on these attributes, they have to be assigned consistently. That can be checked with the attribute assignment scheme/stage proposed above. Currently, you can only assume that someone has identified all the egg-bearing copepods if some classes with eggs are there. But you can not be sure and you have no record in the data (maybe in a protocol), as you do not have the class "copepod + without eggs". Nobody generates or sorts into this class. Mostly, the classes where we provide attributes are now generated out of curiosity. And would you want classes copepod + without eggs + wo lipid sack + wo gut visible???? You are saying that the "base class" is this, a copepod for which the mentioned attributes were checked but are negative. This is another very important reason to have attributes ...

Cheers, I'll go swimming ;)

rkiko commented 4 years ago

Ah, I am also not sure if I would have larval stages as attributes or classes. I would say where they are clearly distinguishable (nauplii vs. copepodites) classes, and attributes where not, but you could also make the case that sorting into different naupliar stages should in principal be possible, so they should be classes. I guess we need to define this clearly. Maybe, if something has a hierarchical, phylogenetic meaning down to the species level it could be a class, if not, it needs to be an attribute. Examples

Calanus hyperboreus N1 nauplius could be a class: crustacea/copepoda/.../C.hyperboreus/nauplius/n1

against

crustacea/copepoda/nauplius

If we now here have an n1, it should maybe rather be an attribute ...? Difficult...

So, what about males vs. females? In many cases we can not decide, so I would argue for an attribute.

I guess we need clear rules when a class generation is allowed and when an attribute needs to be used... But if something has a full gut or not is not a taxonomic characteristic, it is clearly an attribute. Although I just ate, so I feel more human ;)

Cheers, Rainer.

picheral commented 4 years ago

I am usually the one arguing for making things right rather than easy. But here I would vote for practicality. In addition, it is not completely wrong either: the hierarchical taxonomy if a center piece of EcoTaxa (it is right in the name), so making things fit within that taxonomy is meaningful.

This is also my opinion but I would also add that as project managers, we have to make choices for the usage of the always limited ressources we have. My suggestion for better homogeneisation and limited development cost would be to facilitate the standardisation of "taxon like" attributes by suggesting pre-defined names in the EcoTaxo interface.

rkiko commented 4 years ago

"I am usually the one arguing for making things right rather than easy. But here I would vote for practicality. In addition, it is not completely wrong either: the hierarchical taxonomy if a center piece of EcoTaxa (it is right in the name), so making things fit within that taxonomy is meaningful."

A full gut is not a taxonomic feature, same for carrying eggs. So we can agree that categories with such names should not be allowed, if the "taxa" in the name is the decisive thing. What you suggest here is like trying to fit circles into squares, because they are both blue. Other way to go is to think what is needed and build a tool for it.

And we do not have limited resources, we have a lot of time to do this. We just have to say now that this is the way we want to go and then get the money to get it done.

Naming rules in ecotaxa are currently not helpful to set a framework for attributes. Creating rules to do something, that could be done better with attributes is not foreward-looking. We should make things right, although it might take one or two years ...

rkiko commented 4 years ago

And there are already names that we can not bring together anymore (from the ecotaxa taxonomy):

Paraeuchaeta>female+eggs+ectoparasites
living>other>egg
living>other>egg>like
living>Eukaryota>Opisthokonta>Holozoa>Metazoa>Arthropoda>Crustacea>Maxillopoda>Copepoda>Calanoida>Acartiidae>Acartia>Acartia   sinjiensis>egg
living>Eukaryota>Opisthokonta>Holozoa>Metazoa>Chordata>Craniata>Vertebrata>Gnathostomata>Actinopterygii>egg
living>Eukaryota>Opisthokonta>Holozoa>Metazoa>Mollusca>Gastropoda>Heterobranchia>Euthyneura>Euopisthobranchia>Thecosomata>Cavoliniidae>Cavolinia>Cavolinia   inflexa>egg
living>Eukaryota>Opisthokonta>Holozoa>Metazoa>Mollusca>egg
living>Eukaryota>Opisthokonta>Holozoa>Metazoa>Chordata>Craniata>Vertebrata>Gnathostomata>Actinopterygii>Clupeiformes   temp>Engraulidae temp>egg 1 temp
living>Eukaryota>Opisthokonta>Holozoa>Metazoa>Chordata>Craniata>Vertebrata>Gnathostomata>Actinopterygii>Clupeiformes   temp>Engraulidae temp>egg 2 3 temp
living>Eukaryota>Opisthokonta>Holozoa>Metazoa>Chordata>Craniata>Vertebrata>Gnathostomata>Actinopterygii>Clupeiformes   temp>Engraulidae temp>egg 4 6 temp
living>Eukaryota>Opisthokonta>Holozoa>Metazoa>Chordata>Craniata>Vertebrata>Gnathostomata>Actinopterygii>Clupeiformes   temp>Engraulidae temp>egg 7 8 temp
living>Eukaryota>Opisthokonta>Holozoa>Metazoa>Chordata>Craniata>Vertebrata>Gnathostomata>Actinopterygii>Clupeiformes   temp>Engraulidae temp>egg 9 11 temp
living>Eukaryota>Opisthokonta>Holozoa>Metazoa>Chordata>Craniata>Vertebrata>Gnathostomata>Actinopterygii>Clupeiformes   temp>Engraulidae temp>egg unkn temp
living>Eukaryota>Opisthokonta>Holozoa>Metazoa>Chordata>Craniata>Vertebrata>Gnathostomata>Actinopterygii>Clupeiformes   temp>Clupeidae temp>Sardina temp>egg 1 temp
living>Eukaryota>Opisthokonta>Holozoa>Metazoa>Chordata>Craniata>Vertebrata>Gnathostomata>Actinopterygii>Clupeiformes   temp>Clupeidae temp>Sardina temp>egg 2 3 temp
living>Eukaryota>Opisthokonta>Holozoa>Metazoa>Chordata>Craniata>Vertebrata>Gnathostomata>Actinopterygii>Clupeiformes   temp>Clupeidae temp>Sardina temp>egg 4 6 temp
living>Eukaryota>Opisthokonta>Holozoa>Metazoa>Chordata>Craniata>Vertebrata>Gnathostomata>Actinopterygii>Clupeiformes   temp>Clupeidae temp>Sardina temp>egg 7 8 temp
living>Eukaryota>Opisthokonta>Holozoa>Metazoa>Chordata>Craniata>Vertebrata>Gnathostomata>Actinopterygii>Clupeiformes   temp>Clupeidae temp>Sardina temp>egg 9 11 temp
living>Eukaryota>Opisthokonta>Holozoa>Metazoa>Chordata>Craniata>Vertebrata>Gnathostomata>Actinopterygii>Clupeiformes   temp>Clupeidae temp>Sardina temp>egg unkn temp
living>Eukaryota>Opisthokonta>Holozoa>Metazoa>Arthropoda>Crustacea>Maxillopoda>Copepoda>Poecilostomatoida>Oncaeidae>Oncaea>female/eggs
living>Eukaryota>Opisthokonta>Holozoa>Metazoa>Arthropoda>Crustacea>Maxillopoda>Copepoda>Calanoida>Clausocalanidae>Pseudocalanus>female/eggs
living>Eukaryota>Opisthokonta>Holozoa>Metazoa>Chordata>Craniata>Vertebrata>Gnathostomata>Actinopterygii>egg>empty
living>Eukaryota>Opisthokonta>Holozoa>Metazoa>Chordata>Craniata>Vertebrata>Gnathostomata>Actinopterygii>egg>small
living>Eukaryota>Opisthokonta>Holozoa>Metazoa>Chordata>Craniata>Vertebrata>Gnathostomata>Actinopterygii>egg>medium
living>Eukaryota>Opisthokonta>Holozoa>Metazoa>Arthropoda>Crustacea>Maxillopoda>Copepoda>Cyclopoida>Oithonidae>Oithona>Oithona   similis>female+eggs
living>Eukaryota>Opisthokonta>Holozoa>Metazoa>Arthropoda>Crustacea>Maxillopoda>Copepoda>Cyclopoida>Oithonidae>Oithona>Oithona   atlantica>female+eggs
living>Eukaryota>Opisthokonta>Holozoa>Metazoa>Arthropoda>Crustacea>Maxillopoda>Copepoda>Cyclopoida>Oithonidae>Oithona>female+eggs
living>other>egg>egg sac
living>Eukaryota>Opisthokonta>Holozoa>Metazoa>Arthropoda>Crustacea>Maxillopoda>Copepoda>Calanoida>Euchaetidae>Paraeuchaeta>female+eggs
living>Eukaryota>Opisthokonta>Holozoa>Metazoa>Arthropoda>Crustacea>Malacostraca>Eumalacostraca>Amphipoda>Senticaudata>Calliopiidae>Apherusa>Apherusa   glacialis>female+eggs
living>Eukaryota>Opisthokonta>Holozoa>Metazoa>Arthropoda>Crustacea>Maxillopoda>Copepoda>Calanoida>Euchaetidae>Paraeuchaeta>female+eggs+ectoparasites
living>Eukaryota>Opisthokonta>Holozoa>Metazoa>Arthropoda>Crustacea>Maxillopoda>Copepoda>Calanoida>Euchaetidae>Paraeuchaeta>Paraeuchaeta   glacialis>female+eggs+ectoparasites
living>Eukaryota>Opisthokonta>Holozoa>Metazoa>Arthropoda>Crustacea>Maxillopoda>Copepoda>Calanoida>with-eggs
living>Eukaryota>Opisthokonta>Holozoa>Metazoa>Arthropoda>Crustacea>Maxillopoda>Copepoda>Cyclopoida>Oithonidae>Oithona>with-eggs
living>Eukaryota>Opisthokonta>Holozoa>Metazoa>Arthropoda>Crustacea>Maxillopoda>Copepoda>Cyclopoida>Oithonidae>Oithona>with-eggs-lateral
living>Eukaryota>Opisthokonta>Holozoa>Metazoa>Arthropoda>Crustacea>Maxillopoda>Copepoda>Poecilostomatoida>with-eggs
living>other>egg>circular egg
living>Eukaryota>Opisthokonta>Holozoa>Metazoa>Chordata>Craniata>Vertebrata>Gnathostomata>Actinopterygii>Teleostei>egg
grololo06 commented 4 years ago

I think 'we' should start with a listing of all these "things we can see on the pictures which complement the species information", with their possible combinations. I mean, for those of us who do not have a bio background :)

jiho commented 3 years ago
# extract taxo
taxo <- tbl(db, "taxonomy") %>% select(id, parent_id, name, taxotype, nbrobj) %>% collect()
# extract used, morphological taxa
taxo %>%
  filter(nbrobj>0, taxotype == "M") %>%
  select(-taxotype) %>%
  # add the parent and lineage to give context
  mutate(
    unique_name=taxo_name(id, taxo, unique=TRUE),
    lineage=lineage(id, taxo)
  ) %>%
  relocate(nbrobj, .after="parent_id") %>%
  arrange(name) %>%
  write_csv("morpho_taxa.csv.gz")

morpho_taxa.csv.gz

jiho commented 3 years ago

And a cleaned up version in which I try to categorise what the morphological taxa designate https://docs.google.com/spreadsheets/d/1knrjgyQyHeFnGt9B5gGcRe8cr8XrKkrQ71KoXR8D1jM/edit#gid=1782421282

Summary: over ~550 morpho categories:

NB: the total is more than 550 because some names designate two things at once.

Only the last 3 could be defined as attributes; and for life stages, many do not make sense for taxa outside of a certain phylum (e.g. veliger larvae are only in Molluscs).

jiho commented 3 years ago

So overall, beyond doing a bit of cleanup, which we should do, I really don't see the point in setting up an overly complicated system for such a small number of things, in particular when an alternative solution (help with the selection of standard names from a list) offers most of the benefits.

rkiko commented 3 years ago

I have to say that I completely disagree with your take on this. For me this is actually a frustrating chaos that only will get worse. You will just create a huge overhead if you create all the standard names for this:

calanus / female+eggs+ectoparasites

which for a full proper sorting requires:

calanus / female + no eggs + no ectoparasites calanus / female + egg + no ectoparasites calanus / male + no eggs + no ectoparasites (although male != eggs anyhow)

and so on ...

Assuming that

calanus / female is equal to calanus / female + no eggs + no ecotparasites

is just bad practice.

This one: """ ~150 describe the shape of the object; but is is often difficult to know if this is a mere precision on an existing category, to help train classif models (e.g. Appendicularia > s-shaped vs Appendicularia > straight) or if the shape define the category itself (e.g. Rhizaria > dark with some spikes = has a well known shape, should have a phylogenetic name but I cannot find it for sure) """

is another reason why attributes are needed!!! Attributes will help to clarify if it is a distinct category or an attribute!!!

It might be good to have standard names now. But in the long run, this is no solution, because you can not aggregate the data in a meaningful way across projects. Some will use standard names, others not. To get to standard names now, you will need to force all users of EcoTaxa to review their categories and agree on standard names. That will be a crazy endeavour...

It would be good to investigate how attribute tag's could be introduced (complexity of the task, how much time/money it will take), instead of dumping the idea.

jiho commented 2 years ago

Reviving an old thread just to make a few notes:

One aspect not mentioned above is how DarwinCore treats these. Some they are indeed attributes of a taxon (e.g. sex, developmental stage, etc.) which would push in favour of attributes (although very few exist in DwC). But they are counted as separate occurrences (e.g. C hyperboreus without attribute is a separate occurrence from C hyperboreus juvenile) which, in EcoTaxa parlance, means they are categories. So it pushes towards the implementation of some attributes as attributes in the database but for their presentation as separate categories in the interface and the export.

jiho commented 2 years ago

If we ever do it then I'd say we need namespace as prefixes (shape:elongated, colour:dark, sex:male, etc.) and then attributes and combinations of attributes would show as subcategotries, in alphabetical order, e.g.

Copepodus schroderus
Copepodus schroderus [repro:non-ovigerous] [sex:female] [stage:adult]
Copepodus schroderus [repro:ovigerous] [sex:female] [stage:adult] [view:lateral]
Copepodus schroderus [repro:ovigerous] [sex:female] [stage:adult] [view:frontal]
Copepodus schroderus [sex:male] [stage:adult]
Copepodus schroderus [sex:female] [stage:adult]
Copepodus schroderus [sex:female] [stage:adult]
Copepodus schroderus [stage:juvenile] [view:lateral]
Copepodus schroderus [view:frontal]

Each shows only the objects that have the combination of all elements. However, one can already see from that example above that it may cause issues in terms of UI; and it still does not guarantee that all attributes are filled for all objects (which I am not sure we can guarantee). Food for thought...

moi90 commented 2 years ago

I 100% agree with the namespacing. (This is also what I have in my proof-of-concept.)

I also agree that for the classifier, an "object description" (taxon with tags) should be a unique category (for the time being). (Predicting all these attributes is a much bigger undertaking than merely supporting such annotations.)

But from the UI perspective, it should be possible to increase the taxonomic resolution (move into a sub-taxon) and add or remove tags without moving an object to a totally different place in the tree (as it is currently the case: "Copepoda/female" is not a parent of "Calanoida/female"). Yes, will be difficult to design. But we might find a satisfactory solution if we play with prototypes.

(Also, +1 for the great Copepod species name!)

moi90 commented 2 years ago

Until tags become native to EcoTaxa (which may be never), we need to come back to the question how to represent tag-like data using categories (i.e. "Homogenise taxonomic sublabels such as male, female, part, etc."), although this will be "a frustrating chaos that only will get worse" (~RK).

I think the taxonomic hierarchy should be reserved to taxon (phylo), category (not phylo but still a distinct category) and morphology (a distinct shape within a category), and should not extend to other aspects. E.g. Copepoda > Copepodus schroderus > repro:ovigerous+sex:female+stage:adult+view:lateral, not Copepoda > Copepodus schroderus > repro:ovigerous > sex:female > stage:adult > view:lateral (> establishing a parent-child-relationship and + being just a character) (This is in order to limit the depth of the hierarchy for UI reasons.)

Then, we still have the problem of order: repro:ovigerous+sex:female is conceptually the same as sex:female+repro:ovigerous, but a different string. I would just always sort them alphabetically.

(This approach would also leave the option to later implement tags in the database but keep the UI basically the same.)

Will there be a chapter about category naming in the taxonomy guide that @picheral was talking about? It would be useful to lay down some rules there.

moi90 commented 1 year ago

One further comment on the view tag: Instead of specifying the depicted side of the organism (lateral/dorsal/ventral/frontal), we should maybe specify the orientation of the image plane relative to the organism (median/frontal/transverse):

planes

This would be more appropriate when dorsal/ventral or anterior/posterior is ambiguous because the organism is transparent and one can, in fact, only tell the plane but not the view. (frontal does not even fit into the line of lateral/dorsal/ventral, it should be anterior/posterior instead.)

moi90 commented 1 year ago

As an intermediate step, before tags may eventually implemented (JO: "I have no idea when, except not soon."), I propose to go forward with the status quo (using categories like Copepoda>female+with-eggs+lateral), except that we start storing structured data in the category description. This would allow external tools to work with tag data and translate that back and forth between a tag-aware format and EcoTaxa's tag-agnostic categories. This is my proposal in detail:

This system provides a structured way to enrich the current EcoTaxa categories with tag-based data. If EcoTaxa adopts tags in the future, there is a clear transition path from this form to a more native solution. In the meantime, external tools can work with tag data and still have the EcoTaxa database as the single source of truth while providing advanced features like querying by individual properties.