Closed krstopro closed 7 months ago
It really depends on what you want. Nx.to_batched/3 is a starting point and Explorer can give you parallel capability. Enum, Stream, and Task.async_stream will encapsulate many conveniences as well (and there is also Flow). I think we need to step back from the answer (DataLoader) and think about the question and which problems we want to solve. Several examples will help. Then if there is a shared need, a new abstraction may arise.
I think I saw the same patterns being used several times. Basically, data shuffling using Enum.shuffle/1
, batching with Nx.to_batched/3
or Stream.chunk_every/4
, augmentation with Stream.map/2
, etc. Something like data_stream in this notebook.
Feel free to close the issue if you feel like it's impossible to wrap this in a module.
I think the biggest question is: should we wrap it? Why not use functions that compose neatly? Isn't that better than a large interface with several options that hide how they relate to each other?
I am not sure myself, which is another reason I asked. There aren't that many options that a loader should support: batch size, whether to shuffle the dataset, number of workers, whether to drop the incomplete batch, collate function, etc. Perhaps releasing a notebook with a conventional way to do it might be a better idea (assuming there isn't one already that I missed).
Yeah, Axon already has a guide and maybe we can have one in Scholar that starts with a dataset and manipulates it, but for now I don't think it is a Nx specific concern. :) Thanks!
Agreed.
From what I saw so far, the conventional way to iterate the data in batches is to use functions from
Enum
andStream
. Are there already abstractions for doing this in Nx, e.g. like DataLoader in PyTorch?