argilla-io / argilla

Argilla is a collaboration tool for AI engineers and domain experts to build high-quality datasets
https://docs.argilla.io
Apache License 2.0
3.91k stars 368 forks source link

Rubrix Cheatsheet/Cookbook #69

Closed ignacioct closed 3 years ago

ignacioct commented 3 years ago

The idea is to cover the interaction between the main NLP libraries. So far, I've found these to include:

If we found any other, it should be quick to add.

Addressing the name, I've found several examples of cheatsheet but none of cookbook. It has more personality and makes more sense to me (as we are explaining how to include other services into our app is kind of mixing different ingredients), but cheatsheet is by far the standard.

One very cool example is streamlit doc. I was wondering if that was a ipynb, but it is an standalone streamlit app, that they have running in a separate repo and uploaded to their premium hosting service. I don't know if something similar can be done from our starting point (a notebook), but would be cool. Other approaches, like readthedocs, are pngs. Maybe not difficult to replicate with our current style, but not very maintainable. So for now, a jupyter is a cool start, and maybe a destination (?).

ignacioct commented 3 years ago

Work in progress:

ignacioct commented 3 years ago

@dvsrepo I'm seeing that spaCy Transformers are really a wrapper of Hugging's, so does it make sense to do TextClassification with spaCy? I'm not finding anything like a zero-shot classifier, everything is focused on extracting lemmas and info and so on.

We can TokenClassification as in tutorial 2, that's a clear road.

dvsrepo commented 3 years ago

@dvsrepo I'm seeing that spaCy Transformers are really a wrapper of Hugging's, so does it make sense to do TextClassification with spaCy? I'm not finding anything like a zero-shot classifier, everything is focused on extracting lemmas and info and so on.

We can TokenClassification as in tutorial 2, that's a clear road.

Hi @ignacioct! spaCy provides much more than a wrapper around Hugging Face transformers (a clear Doc data model, syntactic features, etc.) and many people is using it for TextClassification, but it's true that there are not many "pre-trained" spaCy classifiers available nor a "spaCy Hub" so for the guide I agree we with you we can leave them out, as we want self-contained snippets.

Actually one cool thing would be to use https://allenai.github.io/scispacy/ for the textcat example for spaCy, which is an extension and I thing it comes with pretrained classifiers! Check it out and see if it's possible.

ceteri commented 3 years ago

Textcat sounds good.

How about using KeyVi to build a high performance string encoder at scale?

Also, ScatterText is a way of exploring a binary text-based classifier visually.

ignacioct commented 3 years ago

@dvsrepo I'm seeing that spaCy Transformers are really a wrapper of Hugging's, so does it make sense to do TextClassification with spaCy? I'm not finding anything like a zero-shot classifier, everything is focused on extracting lemmas and info and so on. We can TokenClassification as in tutorial 2, that's a clear road.

Hi @ignacioct! spaCy provides much more than a wrapper around Hugging Face transformers (a clear Doc data model, syntactic features, etc.) and many people is using it for TextClassification, but it's true that there are not many "pre-trained" spaCy classifiers available nor a "spaCy Hub" so for the guide I agree we with you we can leave them out, as we want self-contained snippets.

Actually one cool thing would be to use https://allenai.github.io/scispacy/ for the textcat example for spaCy, which is an extension and I thing it comes with pretrained classifiers! Check it out and see if it's possible.

What I thought of, in the case of spaCy, is to make an example for NER and an example por POS tagging, which I think are two of the main uses of Token Classification.

SciSpacy is a cool extension, but I would include it as a quick guide, not in the cheatsheet. Its syntax comes from spaCy, so that information would be covered here too.

Textcat sounds good.

How about using KeyVi to build a high performance string encoder at scale?

Also, ScatterText is a way of exploring a binary text-based classifier visually.

We could surely include any of these two in here or in separate guides, what do you think @dvsrepo ?

dvsrepo commented 3 years ago

Thanks so much for the suggestions @ceteri ! We'll take them into account for upcoming guides!