Dataset, DataLoaders, transforms for julia

FluxML / model-zoo

Please do not feed the models

https://fluxml.ai/

Other

908 stars 333 forks source link

Dataset, DataLoaders, transforms for julia #112

Closed udion closed 2 years ago

udion commented 5 years ago

Hi, I am a GSoC-19 aspirant, While going through some of the codes in model-zoo I realize that currently data is being loaded in an ad-hoc manner, I see that MLDatasets tries to improve upon this, but I was wondering if we could have something like PyTorch Datasets, DataLoaders and Transforms. I can work on the same, and I can also contribute to zoo by adding examples from zero/few shot learning domain and by creating a wrapper around some common generative models such mixture models, VAEs and a general wrapper for GAN/WGAN/WGAN-GP where one could call it with their custom Generators and Discriminators.

I see this coming off as a single tutorial which will demonstrate the use of loaders and the train/tests APIs (with some interesting few short learning examples).

Thoughts?

@ViralBShah @staticfloat @avik-pal @dhairyagandhi96

DhairyaLGandhi commented 5 years ago

First off, let me welcome you to the Julia and Flux world!

There has been recent discussion about the data loaders on the Slack which points at efficient implementations present in the language and external packages. We ideally don't want to reinvent the wheel with platform specific code with restricted reusability. I suggest having implementations for common use cases like loading and batching images or text which can be general enough to be used in other areas (not specifically ML) in a frictionless manner.

I'd like to hear more about the VAE/GANs idea. @avik-pal has some work in that area already.

avik-pal commented 5 years ago

Hey there, welcome to Julia.

If you are looking for directions in the GAN idea, you can have a look at torchgan. I am more than happy to help out in building something similar for Flux (rather I have ported the losses to julia already).

In general, it would be interesting to have more diversified types of generative models (like you mention) instead of just focussing on GANs.

MikeInnes commented 5 years ago

Pining @oxinabox

oxinabox commented 5 years ago

I think when it comes to data loaders the best project to do, is something that benchmarks natural existing julia solutions, like MLDatasets, CSV.jl / CSVFiles.jl, whatever Images.jl uses etc, against the custom PyTorch and perhaps TensorFlow stuff.

Then with those benchmarks in hand, profile the things that seem slower, and do sensible optimisations that help all use cases, not just ML. E.g. it might prove that async processing is useful. (In my experience async was a massive let down, but distributed rocked https://white.ucc.asn.au/2018/07/14/Asynchronous-and-Distributed-File-Loading.html)

ToucheSir commented 2 years ago

We now have libraries for everything in the title in the broader ML ecosystem. Flux uses some, and the others are used in higher-level libraries like FastAI.jl. The remaining discussion is either out of date or probably better handled by separate issues.

rkube commented 1 year ago

For the record: The Augmentor.jl documentation gives an example that illustrates how to use image augmentations in a lazy-evaluation style similar to pytorch datasets.