Data Loaders in Nx? - Githubissues

elixir-nx / nx

Multi-dimensional arrays (tensors) and numerical definitions for Elixir

2.66k stars 194 forks source link

Data Loaders in Nx? #1468

Closed krstopro closed 7 months ago

krstopro commented 7 months ago

From what I saw so far, the conventional way to iterate the data in batches is to use functions from Enum and Stream. Are there already abstractions for doing this in Nx, e.g. like DataLoader in PyTorch?

josevalim commented 7 months ago

It really depends on what you want. Nx.to_batched/3 is a starting point and Explorer can give you parallel capability. Enum, Stream, and Task.async_stream will encapsulate many conveniences as well (and there is also Flow). I think we need to step back from the answer (DataLoader) and think about the question and which problems we want to solve. Several examples will help. Then if there is a shared need, a new abstraction may arise.

krstopro commented 7 months ago

I think I saw the same patterns being used several times. Basically, data shuffling using Enum.shuffle/1, batching with Nx.to_batched/3 or Stream.chunk_every/4, augmentation with Stream.map/2, etc. Something like data_stream in this notebook.

Feel free to close the issue if you feel like it's impossible to wrap this in a module.

josevalim commented 7 months ago

I think the biggest question is: should we wrap it? Why not use functions that compose neatly? Isn't that better than a large interface with several options that hide how they relate to each other?

krstopro commented 7 months ago

I am not sure myself, which is another reason I asked. There aren't that many options that a loader should support: batch size, whether to shuffle the dataset, number of workers, whether to drop the incomplete batch, collate function, etc. Perhaps releasing a notebook with a conventional way to do it might be a better idea (assuming there isn't one already that I missed).

josevalim commented 7 months ago

Yeah, Axon already has a guide and maybe we can have one in Scholar that starts with a dataset and manipulates it, but for now I don't think it is a Nx specific concern. :) Thanks!

krstopro commented 7 months ago

Agreed.