from datasets import load_dataset
class HuggingFaceDataCache(DataCache):
"""
A class to load HuggingFace datasets by name.
"""
def load_dataframe(self, *args, **kwargs):
return load_dataset(self.parameters.name)
The challenge here is not loading the dataset, but rather: How do we know it's working? How can we use it after we load it to confirm that we did it correctly?
Plexus needs standard tools for analyzing datasets that we load, right?
We need a HuggingFaceDataCache.py class here: https://github.com/AnthusAI/Plexus/tree/main/plexus/data
It's easy: https://colab.research.google.com/drive/1jffzDNeDNSAIO37YUDT7YSQehSNfE1Uu?usp=sharing
Something like:
The challenge here is not loading the dataset, but rather: How do we know it's working? How can we use it after we load it to confirm that we did it correctly?
Plexus needs standard tools for analyzing datasets that we load, right?
Emerging tool for that: https://github.com/AnthusAI/Plexus/blob/main/plexus/cli/DataCommands.py#L78
Our goal here would be to get that tool to display useful information about that dataset.
To do that, we'll need a scorecard configuration with at least one score that uses that data cache class to load a dataset.
Something like: