argilla-io / argilla

Argilla is a collaboration tool for AI engineers and domain experts to build high-quality datasets
https://docs.argilla.io
Apache License 2.0
4.05k stars 382 forks source link

Support for hierarchical multilabel text classification (taxonomy) #1607

Open guilhermeschmitd-hotmart opened 2 years ago

guilhermeschmitd-hotmart commented 2 years ago

Is your feature request related to a problem? Please describe. A relatively common text classification task I encounter is hierarchical multi-label classification, like classifying scientific paper abstracts or e-commerce product descriptions in one of more nested classes. From what I've seen rubrix still has no support for this type of labeling.

Describe the solution you'd like I'd like to request support for hierarchical multi-label text classification. One of my favorite implementations of this comes from label-studio. Ideally, the task model would allow for truncated classification and a limit of labels per document.

Describe alternatives you've considered One workaround would be "flattening" the available labels by concatenating label names and treating each combination independently, but this approach gets out of hand fast as the number of classes increase. It also suffers from loss of information as we discard parent classes context for child classes. Another problem with flattening is related to "truncated" classes, for example if both Natural sciences > Biology > Mammals and Natural sciences > Biology are valid classes, the number of flattened classes increases a lot, cluterring the interface.

Another workaround would be listing all level 1, 2 ... n classes available and trusting the annotator to input only valid combinations, but that again is a less-than-optimal workflow and might result in ambiguous or invalid labels.

Additional context Relevant references using this labeling task model:

dvsrepo commented 2 years ago

Thanks so much for the very detailed proposal @guilhermeschmitd-hotmart . We will add this to the items to prioritize in the roadmap.

guilhermeschmitd-hotmart commented 2 years ago

Thanks! I'll be on the lookout for updates to the repo, @ me anytime to further discuss this feature and possible implementations ^^

hanshupe commented 2 years ago

Support for taxonomies would be indeed an urgent feature, thanks for bringing it up.

github-actions[bot] commented 2 years ago

This issue is stale because it has been open for 30 days with no activity.

github-actions[bot] commented 1 year ago

This issue is stale because it has been open for 30 days with no activity.

louisguichard commented 1 year ago

Hello, any update on this? On some tasks it can really be a blocking point. Thanks a lot!

nataliaElv commented 1 year ago

Hello @louisguichard ! I'm currently looking at the issue and the limitations of the Feedback Datasets to do a taxonomy classification. Do you happen to know of any public datasets with multi-level taxonomies?

guischmitd commented 1 year ago

@nataliaElv Here's a pretty good standard to follow https://webdatacommons.org/structureddata/2014-12/products/gs.html

louisguichard commented 1 year ago

Hi @nataliaElv and thanks for your feedback!

The 20 Newsgroup Dataset can be seen as a hierarchical classification problem with 7 main classes. Here's a snippet of code if you'd like to take a closer look:

from sklearn.datasets import fetch_20newsgroups

# Fetch dataset
newsgroups = fetch_20newsgroups(subset='all')

# Create a DataFrame to hold the dataset
df = pd.DataFrame({'text': newsgroups.data, 'target': newsgroups.target})

# Map target to raw categories
df['raw_category'] = df['target'].apply(lambda x: newsgroups.target_names[x])

# Split raw categories into main and sub categories
df["main_category"] = df["raw_category"].apply(lambda cat: cat.split(".")[0])
df["sub_category"] = df["raw_category"].apply(lambda cat: "-".join(cat.split(".")[1:]))

Note that we could add several other levels of classes as well.

I feel that providing an option for hierarchical classification might not only help address this kind of problem, but also simplify the UI when the number of classes is large (you could first choose the main field and then specify).

Hope it helps!

nataliaElv commented 1 year ago

Thanks for the links @guischmitd @louisguichard ! From looking at your references, it seems like in some cases the taxonomy is applied to text classification and other times as an entity classification task, would you say that's right? I'll investigate more on this matter. Thanks!

louisguichard commented 11 months ago

Hi @nataliaElv!

I guess that hierarchical labels are mostly used for text classification problems, but we could indeed imagine a NER with hierarchical entities.

I think this problem could also be seen from a UI point of view: the idea would be to be able to label data on several questions at once, with the labels of the second question depending on the answer to the first question.

nataliaElv commented 11 months ago

Something like this then? https://github.com/argilla-io/argilla/issues/3112

louisguichard commented 11 months ago

Yes, I think this could be a good solution!

As long as you can choose a different question for each of the options in the previous question, with each question having its own labels.

davidberenstein1957 commented 10 months ago

It might also be relevant to look into merging labels from child to parent, however, this might be covered via the bulk-labelling feature. An indicated warning for this would be welcome though.

nataliaElv commented 8 months ago

@guilhermeschmitd-hotmart @louisguichard @guischmitd We've shared an early prototype of this question with the community here: https://rubrixworkspace.slack.com/archives/C05DHJ3LGQM/p1710932414340039 Feel free to take a look and leave any comments/thoughts you may have 😄 Thanks!