Refresh Model Zoo - Githubissues

ToucheSir commented 3 years ago

This issue will track long term progress for the model-zoo. Here are the steps to get there (in order):

[x] Test all models: run through all the tutorials to check whether they work on the latest release and document problems
[x] Address outstanding external PRs: take care of PRs such as
- [x] Metalhead.jl
- [x] Flux.jl outdims
- ~~LearnBase.jl~~
- ~~yet to be filed MLDataPattern.jl~~
[x] Issue/PR clean-up: close stale issues and PRs
[ ] Update models: get models working with latest release (no new tutorials!)
[ ] Add new tutorials: add new tutorials (especially ones that make use of the new ecosystem packages)
[ ] Trim old tutorials: remove tutorials that are superseded by new tutorials
[ ] Set up benchmarking CI: convert tutorials to Weave.jl like SciML
[ ] Translate to benchmarks: convert as many tutorials as possible to use the benchmark model

Original Text

Collecting these issues and PRs here for now. Eventually we may want to split some out for triage.

[ ] https://github.com/FluxML/model-zoo/pull/241 update to Flux 0.11 (note: this is already done for the tutorial)
[ ] https://github.com/FluxML/model-zoo/issues/235 incorrect dimension order for some CNNs
[ ] https://github.com/FluxML/model-zoo/pull/199 use the non-deprecated TagBot Action
[ ] https://github.com/FluxML/model-zoo/issues/154 rm bitrotted diffeq examples since they appear to be covered under SciML
[ ] https://github.com/FluxML/model-zoo/issues/244 fix model non-convergence
[ ] https://github.com/FluxML/model-zoo/pull/203 clean up unnecessary use of global scope
[ ] https://github.com/FluxML/model-zoo/issues/218, https://github.com/FluxML/model-zoo/pull/192 many models are missing READMEs
[ ] [TODO issue] MNIST batch size is too aggressive and causes frequent OOMs
[ ] [TODO issue] add CI (and possibly convergence testing?)

adinhobl commented 3 years ago

What is the overall vision for the model zoo? Like what attributes would make a good zoo model? I'm going to throw out some things and let me know if they are what you have in mind:

single .jl file + Project.toml: self-contained, no need to have a bunch of files, easier setup (just ]instantiate, run Julia file). Not easy to document what different pieces are doing without long comments
jupyter notebook: example like this which gives a full tutorial to the ecosystem (I'm working on this right now to learn myself, but it could be useful to know the whole pipeline and options for other newcomers). But this is harder to get up and running and hard to build a benchmark pipeline around
optimized to be extremely fast: CuIterators and carefully thought out optimizations, no global scope. Harder for a Flux/Julia newbie to do
Everything up to date: keep the examples relevant, identify bugs and holes before other people try them, actively prevent bitrot
What to do about the data? Does every model need to implement its own method for downloading data? is there a best practice?

I read someone somewhere else said that they thought getting a working model zoo up to best practices would set the stage for growing the ML/DL/Flux ecosystem because then it's easier to benchmark and progressively improve.

adinhobl commented 3 years ago

As someone newer to the ecosystem and Julia, it would probably be good to have a list of best practices when implementing a model - a checklist with explanations. It would promote consistency for the model zoo models and be a good jumping point for new people in the community. There's a PR for something about putting things in the global scope, but someone new from python won't necessarily know not to do that for their model.

I think there is also likely an opportunity to improve these pages as well, specifically the performance one: https://fluxml.ai/Flux.jl/stable/ecosystem/ https://fluxml.ai/Flux.jl/stable/performance/

adinhobl commented 3 years ago

As for my first post above, I just found Literate.jl, which I think could be a solution to having to maintain 1 .jl file but still being able to take advantage of the richness of notebooks. If they were all done in that style, there could be unit tests and CI but it could be easily transferable to notebooks. Just a thought.

darsnack commented 3 years ago

These are great comments. I'll try and give some insight on some of them.

Julia files vs. Jupyter notebooks: I believe Chris has given us some references on how SciML uses Literate.jl (or maybe it was Weave.jl) to generate Julia + notebooks as well as benchmark with CI. (link)
Staying up to date: This is absolutely crucial, and I think that using the zoo for benchmarking will help ensure this.
Data management: All zoo models should use MLDatasets.jl, or implement the custom dataset interface.

ChrisRackauckas commented 3 years ago

It's all done by https://github.com/SciML/RebuildAction

darsnack commented 3 years ago

As someone newer to the ecosystem and Julia, it would probably be good to have a list of best practices when implementing a model - a checklist with explanations.

Totally agree with this. Once the Metalhead.jl PR is resolved, I can write up a section on this in the docs for Flux.jl.

ToucheSir commented 3 years ago

RE documenting best practices, there's an old list here that might be a good jumping-off point. I think it's also worth a look at the GSOD PRs for overlap and/or opportunities for collaboration.

adinhobl commented 3 years ago

Thanks for all the responses. It sounds like there are a handful of lists that could be put together generally on the topics of performance and best practices, and I think putting them in the Flux Docs probably makes sense.

@darsnack , when you said "the custom dataset interface" it sounds like you were referring to something in particular?

It also sounds like I need to familiarize myself with RebuildAction, CI, and the SciML Benchmarks, likely after this semester ends.

ChrisRackauckas commented 3 years ago

Feel free to ask for help, and feel free to ping me to join the next ML Fast AI coordination call. I am getting Dhariya involved as well.

darsnack commented 3 years ago

@darsnack , when you said "the custom dataset interface" it sounds like you were referring to something in particular?

Yes, since we've adopted MLDataPattern.jl for iterating datasets, any custom dataset needs to implement the getobs interface. Here is a good overview of what that entails.

johnnychen94 commented 3 years ago

I wrote a wonderful Documenter plugin DemoCards.jl for JuliaImages's demo page. It uses Literate.jl so you can write in plain julia files. It maps the folder structure into page structure so making changes are very flexible.

ghost commented 3 years ago

I'm willing to put some effort in updating the model zoo. Already started testing the models, I'm keeping a list of issues here.

DhairyaLGandhi commented 3 years ago

So we already have the scripts folder which does the conversion, which is for this exact use case, also FluxBot.jl

We need only go through the models to add literature. This is literate based as well, so it should be easy to test things out.

cc @sophb

DhairyaLGandhi commented 3 years ago

Another thing that makes the model zoo less useful is that recently many models have this argument handling as part of every model which distracts from what the scripts are meant to be there for. Removing that and using simpler constructs that point to exactly what aspect is being talked about would make a whole lot of difference

ghost commented 3 years ago

Yes, I noticed this while testing the models, some of them have taken a bit too much of a kitchen sink approach.

IMHO the model zoo should find the right balance between three things:

Impress with what can be done with Flux
Guide people trying to learn Flux
Benchmark the Flux package

My guess is that highlighting just one or maybe two features or ecosystem packages per model and using each of these features or ecosystem packages in only one or two models will result in a good balance. Overdoing it will only result in distraction from the great things that can be done with Flux, harder to grasp tutorials and longer running times for the benchmarks.

Some examples:

Use the dcgan model to highlight a more complex custom training loop and try to use a vanilla training loop in all the other models.
The language detection model can be used to highlight a custom dataset and let all the others use MLDatasets
The VGG CIFAR10 model could be the only one using custom logging with TensorBoardLogger.jl

darsnack commented 3 years ago

IMHO the model zoo should find the right balance between three things:

Impress with what can be done with Flux Guide people trying to learn Flux Benchmark the Flux package

I think if we use @ChrisRackauckas approach with SciML, we should be able to write Literate.jl scripts for all our model zoo examples. And I agree that some examples should throw out some of the kitchen sink and try to focus on the essence of what that example is trying to teach.

Right now, the problem is that all the examples exist to be consumed in script form. Instead, if they were written in Literate.jl, we could have a Publish.jl website for the entire zoo. The pages of that site being the tutorials. I think just that change of writing for a different audience makes a difference in the produced result.

darsnack commented 3 years ago

We are working on establishing a "Flux" Publish theme for all the ecosystem packages. I can put together a sample PR where I translate a couple of the zoo examples and showcase the website.

DhairyaLGandhi commented 3 years ago

I have been saying for a while that if we can get the literature part in the model scripts, the conversion to be pushed to the site is trivial.

We are working on establishing a "Flux" Publish theme for all the ecosystem packages. I can put together a sample PR where I translate a couple of the zoo examples and showcase the website.

A PR would be welcome, have you seen the tutorial pages on the site already?

DhairyaLGandhi commented 3 years ago

we should be able to write Literate.jl scripts for all our model zoo examples.

They already are, check the scripts directory on the model zoo.

Also check the dg/zygote branch for how the zoo used to look like, without the kitchen sink approach.

darsnack commented 3 years ago

~~Ah I didn't realize that the model-zoo was already capable of feeding the tutorials on the website. Does this happening automatically on release?~~

I'm thinking of something like the tutorials page on the website, but it is automatically updated by CI on every "release" of the model zoo. That way every thing in the zoo is consumed as either a tutorial on the website, a runnable script you can download, or a script that the benchmarking CI can run.

In the set up I was describing, the model zoo would have its own website that hosted all these tutorials. Is there a way to tie that to the Flux website? I don't think GH actions can trigger events on other repos?

DhairyaLGandhi commented 3 years ago

I'm thinking of something like the tutorials page on the website, but it is automatically updated by CI on every "release" of the model zoo

So the idea is exactly that, and we have a working example of that tied with FluxBot.jl, plus RebuildAction would allow for benchmarking. We are already setting up a benchmarking suite for GPU performance.

the model zoo would have its own website that hosted all these tutorials. Is there a way to tie that to the Flux website

We should have that happen as part of the flux website, for sure.

DhairyaLGandhi commented 3 years ago

Could we add a bit in the tracker to move the site off Jekyll? Maybe publish.jl or pkgpage.jl or franklin.jl?

darsnack commented 3 years ago

Yeah moving off Jekyll to a Julia-based static website generator would make this all easier for sure.

CarloLucibello commented 3 years ago

Can we replace any occurrence of train! in the model-zoo with a custom loop? Discussion in https://github.com/FluxML/Flux.jl/issues/1461 is not converging, and maybe this is something we could all agree on

DhairyaLGandhi commented 3 years ago

replace any occurrence of train! in the model-zoo with a custom loop?

I don't think that's a great idea, maybe for a model which specifically intends to show the loop or something that benefits from it.

Discussion in FluxML/Flux.jl#1461 is not converging, and maybe this is something we could all agree on

I'm not sure what you mean here? There is ongoing and active discussion and best to have it properly

CarloLucibello commented 3 years ago

I don't think that's a great idea, maybe for a model which specifically intends to show the loop or something that benefits from it.

I suggest the other way around, we give a single example of train!

I'm not sure what you mean here? There is ongoing and active discussion and best to have it properly

It surely good to have the discussion, I'm just saying it is not converging and it could go on for months. My suggestion is to reverse things as currently stand, and primarily point users to the pattern that is more informative and more flexible, and only in the second instance to train!

DhairyaLGandhi commented 3 years ago

Let me rephrase this. The pattern that we have established with Flux.jl is what is represented in the examples, and we shouldn't change things only to change them back if we don't have a clear answer yet which will come with the convergence.

CarloLucibello commented 3 years ago

there is no need to change anything back once you have custom loops

darsnack commented 3 years ago

If most of the examples use train!, and we converge on removing it or downplaying it, then we taught a bunch of users to use a non-preferred API.

The for-loop will always be a first class API, so you can't go wrong by using custom loops everywhere.

DhairyaLGandhi commented 3 years ago

Not really, the for loop is under the same considerations and open to the same changes that any other api flux has had or will have. I prefer the for loop to train personally, but it isn't cogent to push it when it's not

If the examples are better served to not need a for loop, then we did the right thing by teaching users to look for similar constructs in other packages that provide loops for more complex cases, and also educate them about how the loops work by having examples and docs that are meant to teach the for loop api.

API design shouldn't prefer one set of assumptions for every case, but offer options.

I don't want everyone copying the same for loop everywhere, the average case does not need any more complication. If there are average cases not well served, improve the api to catch that case.

CarloLucibello commented 3 years ago

I don't want everyone copying the same for loop everywhere, the average case does not need any more complication. If there are average cases not well served, improve the api to catch that case.

what I exactly want is for people to copy everywhere the for loop, they don't need the complication of train!. We have a very good api, it's called for loop, everyone is already familiar with it, it serves the average case very well, we don't need to improve anything.

darsnack commented 3 years ago

I updated the main post to reflect @joostveenema's progress (correct if I got wrong).

I don't think we need to halt on the JuliaML PRs, since the current available solutions w.r.t. data loading work well. There is no reason to halt meaningful progress on an important model-zoo refresh for a future data loading update (I don't think any of the API changes to JuliaML will surface in the model-zoo in a meaningful way anyways).

I kept the Metalhead.jl PR in there, since I do think any pre-trained models in the zoo should use the future-facing version of Metalhead.jl. I plan on effectively automating the process of training and committing the pre-trained models today. I will update the Metalhead.jl PR when I am ready.

ChrisRackauckas commented 3 years ago

what I exactly want is for people to copy everywhere the for loop, they don't need the complication of train!.

If everyone has to copy the same piece of code around then the abstraction is wrong and we should change it. Maybe callbacks need to support more things, or there can be a few more switches. But telling everyone to roll it out by hand is only useful to exactly the same devs building the library.

The for-loop will always be a first class API, so you can't go wrong by using custom loops everywhere.

Indeed, but flexibility will always be limiting. There will always be some optimizers that require less flexibility (BFGS, KrylovTrustRegion), and so fully promoting the most open choice in a function sense is also limiting in another sense. The path forward is to try and tame what can be done and capture what users do into a simplified API to then specialize and help the code perform better, not let it run too loose. Isn't that the point of Flux in the first place since you can just define the layers by hand?

ghost commented 3 years ago

@darsnack I did start marking issues as stale but wasn't done yet. I'll spend some more time on it this week.

aditkumar72 commented 3 years ago

I have added and updated README of all the vision model in https://github.com/FluxML/model-zoo/pull/305 which fixes #218

aditkumar72 commented 3 years ago

I have made all the changes in https://github.com/FluxML/model-zoo/pull/305 can anyone please review.

FluxML / FluxML-Community-Call-Minutes

Refresh Model Zoo #9