farach / huggingfaceR

Hugging Face state-of-the-art models in R
Other
142 stars 17 forks source link

hf_load_dataset split options are limited #43

Open samterfa opened 1 year ago

samterfa commented 1 year ago

There are some slick ways to pull only subsets of splits you can read up on here. Current implementation doesn't allow for these. They used to work by being passed in via ... An example is hf_load_dataset("the_dataset", split="train[:10%]")

jpcompartir commented 1 year ago

If we decide to implement this, I think we should separate it from the current function. Name it something like hf_load_dataset_slice(), thoughts?

Do wonder if re-directing users to sample_n() or one of the slice functions isn't preferable, though?

samterfa commented 1 year ago

I'd like to be consistent with other functions we've created. The non-ez functions we've implemented generally allow a user who knows what they're doing to get most/all the functionality of the python libraries we are emulating, but also abstract a little of the hard stuff away. In that spirit it seems like we should provide a function that accommodates the raw split argument, and perhaps also have a version that abstracts away the hard stuff, or incorporate them into the same function.

The nice thing about split = 'train[:10]' is that you can preview a dataset without downloading the whole thing. I think we should allow the user to pull a portion of any split, and I think we can do this in a single function. Perhaps we should create a hf_dataset_info() function which pulls the available splits and other useful info. If we wanted to be fancy we could make hf_load_dataset() act like a lazy query so you could do things like hf_load_dataset(split = 'train') %>% slice_sample(n = 10).

jpcompartir commented 1 year ago

Yeah nice, allowing that functionality would be great - previewing the large datasets would be a real boon. I'm wondering if a hf_preview_dataset() function would do the trick - very wary of trying to do too much with hf_load_dataset() - it already does too much, IMO