Open samterfa opened 1 year ago
If we decide to implement this, I think we should separate it from the current function. Name it something like hf_load_dataset_slice(), thoughts?
Do wonder if re-directing users to sample_n()
or one of the slice
functions isn't preferable, though?
I'd like to be consistent with other functions we've created. The non-ez functions we've implemented generally allow a user who knows what they're doing to get most/all the functionality of the python libraries we are emulating, but also abstract a little of the hard stuff away. In that spirit it seems like we should provide a function that accommodates the raw split argument, and perhaps also have a version that abstracts away the hard stuff, or incorporate them into the same function.
The nice thing about split = 'train[:10]' is that you can preview a dataset without downloading the whole thing. I think we should allow the user to pull a portion of any split, and I think we can do this in a single function. Perhaps we should create a hf_dataset_info() function which pulls the available splits and other useful info. If we wanted to be fancy we could make hf_load_dataset() act like a lazy query so you could do things like hf_load_dataset(split = 'train') %>% slice_sample(n = 10).
Yeah nice, allowing that functionality would be great - previewing the large datasets would be a real boon. I'm wondering if a hf_preview_dataset() function would do the trick - very wary of trying to do too much with hf_load_dataset() - it already does too much, IMO
There are some slick ways to pull only subsets of splits you can read up on here. Current implementation doesn't allow for these. They used to work by being passed in via ... An example is
hf_load_dataset("the_dataset", split="train[:10%]")