Add DataCache subclass for loading HuggingFace datasets

endymion commented 2 months ago

We need a HuggingFaceDataCache.py class here: https://github.com/AnthusAI/Plexus/tree/main/plexus/data

It's easy: https://colab.research.google.com/drive/1jffzDNeDNSAIO37YUDT7YSQehSNfE1Uu?usp=sharing

Something like:

from datasets import load_dataset

class HuggingFaceDataCache(DataCache):
    """
    A class to load HuggingFace datasets by name.
    """

    def load_dataframe(self, *args, **kwargs):
      return load_dataset(self.parameters.name)

The challenge here is not loading the dataset, but rather: How do we know it's working? How can we use it after we load it to confirm that we did it correctly?

Plexus needs standard tools for analyzing datasets that we load, right?

Emerging tool for that: https://github.com/AnthusAI/Plexus/blob/main/plexus/cli/DataCommands.py#L78

Our goal here would be to get that tool to display useful information about that dataset.

To do that, we'll need a scorecard configuration with at least one score that uses that data cache class to load a dataset.

Something like:

scores:

  "News Category":
    class: SomeClassifier
    data:
      class: HuggingFaceDataCache
      name: "fancyzhx/ag_news"
      content_column: text
      label_column: label

endymion commented 2 months ago

https://colab.research.google.com/drive/1jffzDNeDNSAIO37YUDT7YSQehSNfE1Uu?usp=sharing

uokesita commented 2 months ago

@endymion I think I have the class ready, but I have no idea on how to use it for a simple score. Is there a small score I can have a look at?

AnthusAI / Plexus

Add DataCache subclass for loading HuggingFace datasets #9