FluxML / Flux.jl

Relax! Flux is the ML library that doesn't make you tensor
https://fluxml.ai/
Other
4.55k stars 609 forks source link

WebDataset.jl, a linearly scalable data loader based on iterable datasets #2037

Open tmbdev opened 3 years ago

tmbdev commented 3 years ago

I'm the developer of WebDataset for PyTorch, a linearly scalable format, libraries, and server for PyTorch. WebDataset represents datasets as .tar archives of files on disk and allows access to them from any web server, object store, and cloud storage system. It's all open source, and we have demonstrated 1 Gbyte/s per GPU I/O speeds.

The PyTorch implementation is at github.com/tmbdev/webdataset; the server implementation is at github.com/nvidia/aistore.

I have recently implemented a multithreaded loader for Julia that can read the same format. You can find it at github.com/tmbdev/WebDataset.jl.

You might want to add this to the resources, as well as take it into account for DataLoaders.jl and FastAI.jl

(I work on very large scale machine learning problems, so my next step is to see how I can get multi-GPU and multinode training to work in Julia.)

AriMKatz commented 3 years ago

Cool!

Both @jpsamaroo and @vchuravy work on multinode/multi GPU computing and you might be interested in working with or reaching out to them. Also check out dagger.jl , https://github.com/JuliaComputing/DataSets.jl, filetrees.jl and the juliafolds ecosystem

darsnack commented 2 years ago

We might want to add this to the ecosystem page when the package is ready?

tmbdev commented 2 years ago

I'm starting to use Flux.jl more heavily, so I'll be adding more examples over the next few weeks.