Open guilhermeschmitd-hotmart opened 2 years ago
Thanks so much for the very detailed proposal @guilhermeschmitd-hotmart . We will add this to the items to prioritize in the roadmap.
Thanks! I'll be on the lookout for updates to the repo, @ me anytime to further discuss this feature and possible implementations ^^
Support for taxonomies would be indeed an urgent feature, thanks for bringing it up.
This issue is stale because it has been open for 30 days with no activity.
This issue is stale because it has been open for 30 days with no activity.
Hello, any update on this? On some tasks it can really be a blocking point. Thanks a lot!
Hello @louisguichard ! I'm currently looking at the issue and the limitations of the Feedback Datasets to do a taxonomy classification. Do you happen to know of any public datasets with multi-level taxonomies?
@nataliaElv Here's a pretty good standard to follow https://webdatacommons.org/structureddata/2014-12/products/gs.html
Hi @nataliaElv and thanks for your feedback!
The 20 Newsgroup Dataset can be seen as a hierarchical classification problem with 7 main classes. Here's a snippet of code if you'd like to take a closer look:
from sklearn.datasets import fetch_20newsgroups
# Fetch dataset
newsgroups = fetch_20newsgroups(subset='all')
# Create a DataFrame to hold the dataset
df = pd.DataFrame({'text': newsgroups.data, 'target': newsgroups.target})
# Map target to raw categories
df['raw_category'] = df['target'].apply(lambda x: newsgroups.target_names[x])
# Split raw categories into main and sub categories
df["main_category"] = df["raw_category"].apply(lambda cat: cat.split(".")[0])
df["sub_category"] = df["raw_category"].apply(lambda cat: "-".join(cat.split(".")[1:]))
Note that we could add several other levels of classes as well.
I feel that providing an option for hierarchical classification might not only help address this kind of problem, but also simplify the UI when the number of classes is large (you could first choose the main field and then specify).
Hope it helps!
Thanks for the links @guischmitd @louisguichard ! From looking at your references, it seems like in some cases the taxonomy is applied to text classification and other times as an entity classification task, would you say that's right? I'll investigate more on this matter. Thanks!
Hi @nataliaElv!
I guess that hierarchical labels are mostly used for text classification problems, but we could indeed imagine a NER with hierarchical entities.
I think this problem could also be seen from a UI point of view: the idea would be to be able to label data on several questions at once, with the labels of the second question depending on the answer to the first question.
Something like this then? https://github.com/argilla-io/argilla/issues/3112
Yes, I think this could be a good solution!
As long as you can choose a different question for each of the options in the previous question, with each question having its own labels.
It might also be relevant to look into merging labels from child to parent, however, this might be covered via the bulk-labelling feature. An indicated warning for this would be welcome though.
@guilhermeschmitd-hotmart @louisguichard @guischmitd We've shared an early prototype of this question with the community here: https://rubrixworkspace.slack.com/archives/C05DHJ3LGQM/p1710932414340039 Feel free to take a look and leave any comments/thoughts you may have 😄 Thanks!
Is your feature request related to a problem? Please describe. A relatively common text classification task I encounter is hierarchical multi-label classification, like classifying scientific paper abstracts or e-commerce product descriptions in one of more nested classes. From what I've seen rubrix still has no support for this type of labeling.
Describe the solution you'd like I'd like to request support for hierarchical multi-label text classification. One of my favorite implementations of this comes from label-studio. Ideally, the task model would allow for truncated classification and a limit of labels per document.
Describe alternatives you've considered One workaround would be "flattening" the available labels by concatenating label names and treating each combination independently, but this approach gets out of hand fast as the number of classes increase. It also suffers from loss of information as we discard parent classes context for child classes. Another problem with flattening is related to "truncated" classes, for example if both
Natural sciences > Biology > Mammals
andNatural sciences > Biology
are valid classes, the number of flattened classes increases a lot, cluterring the interface.Another workaround would be listing all level 1, 2 ... n classes available and trusting the annotator to input only valid combinations, but that again is a less-than-optimal workflow and might result in ambiguous or invalid labels.
Additional context Relevant references using this labeling task model: