FluxML / FluxML-Community-Call-Minutes

The FluxML Community Team repo
50 stars 4 forks source link

ONNX import/export #10

Closed DrChainsaw closed 2 years ago

DrChainsaw commented 3 years ago

I’m willing to put some effort in the ONNX story if there exist some interest.

TBH I don’t really know what fastAI is about so if there is some special meaning to ONNX + fastAI then please let me know.

Non exhaustive rundown of the current status Flux is that @opus111 has created BaseOnnx and I have made a branch of ONNXmutable which makes use of it (replacing ONNX.jl as the source of protos).

Afaik, ONNXmutable is fully functional and verified and the main gripe I have with it is that it is a bit of a monolith. It depends on NaiveNASflux for the model DAG which in turn has the somewhat big dependencies Flux (arguably not a big deal), LightGraphs and MetaGraphs (both which could be removed though with a little effort), JuMP and Cbc. ONNXmutable also has onnx and onnxruntime as test-only dependencies as well as PyCall and Conda to be able to use and depend on them.

As such, I’m mainly focused on breaking down that monolith into smaller and more reusable components which don’t force all those dependencies on to people. Please let me know if this is not what you think is needed here.

Here are some of my thoughts:

Import

In addition to the primitives discussed below, one needs some kind of runnable representation of the computation graph.

Julias fantastic autodiff capabilities have made it obsolete to have a special DAG format for the models (e.g. Tensorflow) as you can just write the DAG as a normal Julia function. While it is indeed possible to translate an ONNX model into a Julia expression which evaluates to a function representing the model (this is the approach taken by ONNX.jl) it has the limitation that the expression (at least in practice) has to be evaluated at the top level.

Flux today does not have a typical DAG. Flux’s built in Chain can represent many DAGs through the usage of SkipConnection but I don’t think it can represent any DAG (without user written functions).

I think that the method used in ONNXmutable can be generalized without too much effort to work for any typical DAG and perhaps this is something which is useful to put in BasicOnnx. I opened an issue about it here.

If there is interest, I think I can make a functor compatible import-onnxmodel-as-a-function macro in BaseOnnx. Drawback is perhaps that it might give people a janky experience due to it only working from the top level, but perhaps this can be solved by generating a better error message than “incorrect world age”.

To reach the end goal of having a useable package, there still needs to be a DAG format to use. One option is that I extract the non-mutation stuff from NaiveNASlib into some AbstractMlDag package. It is already separated from the mutation stuff so it is trivial to move to a separate package.

Export

The method in ONNXmutable uses dispatch for tracing and I think this is good enough for most typical/traditional DL models. It will however fail if it encounters

  1. (Non primitive) functions with type annotations (ie function model(x::AbstractArray) )
  2. Non-function expressions, e.g. if/else and for I think IRTools can be used without too much effort to circumvent 1, but I dread to think about how to make use of it for 2. Mjolnir seems to have the perfect abstraction but I don’t know if it is effectively maintained and ready for use.

For exporting, I’m not as certain that there exist some simple and universal enough solution which is worthwhile to put in BaseOnnx. Should we try to make a generic package or two for this or just mash it together with the framework specific stuff?

Primitives

Primitives are the functions which have a one to one correspondence with an operator defined in ONNX, for example Add, sin, Conv, RNN etc. In other words, this is the part which knows how to transform e.g. a Flux.Conv into an ONNX NodeProto and vice versa.

I don’t think this can be done without manually typing out the mapping for each OP so in any moderately well designed ONNX package this will be the by far biggest effort to create and maintain, especially considering opset versions. To me this makes it quite important that adding OPs can be done easily so that users of the package can contribute with the OPs they need or else it will be a thankless and soul crushing effort to support the whole spec.

To me, this makes it pretty useful to have a package with only the primitives to attract contributions. Obviously one package per ML framework is needed as the primitive package has to depend on the ML framework package (e.g. Flux, KNet, Lilith etc).

Here we can start with the modest set of primitives from ONNXmutable as a start of the Flux package.

Furthermore, import and export have basically nothing to do with each other and I don’t see a way to make use of import primitives for export and vice versa. This opens up for having separate packages for import and export primitives, but it is also kinda nice to be able to test the import functionality with the export functionality and vice versa. Thoughts on this?

Another thought is whether it makes sense to have a package with OPs from Base (e.g. Add, tan, sin, max, reshape etc)?

Testing

Testing numerical computations is always annoying. In ONNXmutable I used 1) test vectors from onnx and 2) comparison with output from onnxruntime. Nothing says onnxruntime is the golden standard ofc, but it seems like a lot of work is being put into it so I think it makes for a pretty good reference.

There are a couple of lines of code to set all of this up and perhaps this can also be made a package which others can make use of (e.g. primitive packages for other ML frameworks).

Package homes

The packages which are specific to a particular ML framework are best served to sit in the same org as their parent frameworks, right? What about the reminders?

Assuming the above, the reminders are basically BaseOnnx, the testing package, the Base OPs and maybe a generic export package or two if it makes sense to create it.

I guess that creating a new org in github is no effort, but perhaps it is good to try to put it into some more well known org like JuliaML.

Another option is to just give up on components which are not tied to any framework and perhaps just create another monolith. I think the advantage of this over what ONNXmutable offers today then is basically that one does not need to depend on JuMP and Cbc. I can't imagine that this is the reason why people state that julia has no functional ONNX import/export though.

DrChainsaw commented 3 years ago

Here is my takeaway from the meeting:

Splitting up ONNXmutable has lower priority than making sure it can load state of the art models. Is this correct? Quickest way to proceed here is to just test loading models and file issues (and preferably PRs 😊 ) when they fail.

If so, I can probably take the functionality from BaseOnnx and add it to ONNXmutable instead along with (protobuf 0.9 compliant) protos in order to get rid of the warning-generating dependency to ONNX.jl and release a 0.1 version very soon. If there is a need to split it up for reuse purposes this can always be done later of course.

I will add examples of loading a model and replacing the head as well as a pruning example here. Please let me know if there is anything else you’d like to see.

Load model and replace classification head (click me to expand)

Here is how to load a model and a few of the ways NaiveNASflux lets you explore the model ```julia (ImportOnnxExample) pkg> add NaiveNASflux, MLDatasets, https://github.com/DrChainsaw/ONNXmutable.jl julia> using ONNXmutable, NaiveNASflux, MLDatasets # Lots of warnings from ONNX.jl :( julia> resnetfile = download("https://github.com/onnx/models/raw/master/vision/classification/resnet/model/resnet18-v1-7.onnx"); julia> model = CompGraph(resnetfile); julia> name.(vertices(model)) # Query names like this julia> nout.(vertices(model)) # List output sizes like this ``` Ok, lets scrape off the classification head and replace it with a new dense layer. ```julia julia> vs = vertices(model); julia> remove!(vs[end], RemoveStrategy(NoSizeChange())); # Normally we want NaiveNASflux to align the sizes between the input to the removed layer and the output, but in this case we explicitly want to change the output size without touching anything else julia> insert!(vs[end-1], v -> mutable("newhead", Dense(nout(v), 10), v)); julia> newmodel = CompGraph(vs[begin], outputs(vs[end-1])); julia> vertices(newmodel) |> last |> nout 10 julia> vertices(newmodel) |> last |> name "newhead" ``` Unfortunately removing the very last layer is slightly more inconvenient than any other because 1) the CompGraph has a reference the output layer and 2) because NaiveNASflux will by default think that it needs to change the size of the previous layer to have the same output size (which is a pretty reasonable default assumption to be fair). In normal cases one can just do `remove!` and `insert!` and it will change the model without having to create a new model. Anyways, now one can just (re-)train the model as normal. ```julia julia> Flux.adapt(T, x::Flux.Zeros) = x # Workaround for Flux issue # 1332 julia> Flux.adapt(T, x::Base.ReinterpretArray) = T(x) # ONNX.jl turns TensorProtos into `ReinterpretedArray`s and Zygote does not like that. BaseOnnx makes them normal Array, so this is a temporary nuisance julia> newmodel = newmodel |> cpu; # Same as above (this should ofc be gpu if training on gpu) julia> loss(x,y) = NaiveNASflux.Flux.Losses.logitcrossentropy(newmodel(x), y); # forgot to add Flux, but it is available through NaiveNASflux julia> x,y = MLDatasets.CIFAR10.traindata(); julia> xpadded = zeros(Float32, 224,224, 3, 8); # Need to pad excessively to match size for this constructed use case julia> xpadded[97:128,97:128,:,:] .= x[:,:,:,1:8]; julia> yhot = NaiveNASflux.Flux.onehotbatch(y[1:8], 0:9); julia> pshead = NaiveNASflux.Flux.params(vertices(newmodel) |> last |> layer); # Only train the parameters of the last layer julia> length(pshead) 2 julia> NaiveNASflux.Flux.train!(loss, pshead, [(xpadded, yhot)], ADAM()) # I don’t have a GPU on this computer, so I’ll only do one batch ```

DrChainsaw commented 3 years ago
Here is the pruning use case

```julia julia> model = CompGraph(resnetfile); julia> numneurons(m) = mapreduce(nout, +, vertices(m)); # Not needed, just to show that something happened julia> numneurons(model) 15531 julia> function pruning_metric(v, offs) val = neuron_value(v) # neuron_value defaults to magnitude of parameters along activation dimension ismissing(val) && return fill(offs, nout_org(v)) # Layers with no parameters return missing by default return val .- min(offs, 0.8*maximum(val)) # min is crude safeguard to prevent that models get size 0 end julia> allvals = mapreduce(neuron_value, vcat, vertices(model)) |> skipmissing |> collect; julia> cutoff = partialsort(allvals, round(Int, 0.3*length(allvals))) # Cutoff is bigger than (approx) 30% of all values julia> for v in vertices(model) # It is currently a limitation that one first must reduce the size of each individual vertex. If you want to use NaiveNASflux for pruning I think I can fix this limiation metric = pruning_metric(v, cutoff) nprune = sum(<(0), metric) - length(metric) + nout(v) nprune <= 0 && continue Δnout(v, -nprune) end julia> Δoutputs(model, v -> pruning_metric(v, cutoff)); # Given that we have new sizes for all layers, which neurons do we decide to keep julia> apply_mutation(model); julia> model(ones(Float32, 224,224,3,2)) |> size # Model is still internally consistent (1000, 2) julia> numneurons(model) # 30% fewer neurons 10042 ``` Now, this turned out to prune alot more than 30% of all parameters, so chances are the model accuracy suffered alot. It also seems like the layers with more parameters also tend to have lower magnitude of their parameters, so the strategy to have a global cutoff is probably not the best. It should however be straight forward to change the above example to do a comparison per layer instead (i.e prune 30% of each layer instead of 30% in total) so I'll leave that exercise to the reader :). Note that to train this model one needs to do that annoying (and soon to be fixed) dance with overloading a few Flux methods and mapping to `cpu`.

Obviously all the above examples could (and perhaps should) be wrapped in much nicer APIs. NaiveNASflux tries to be more of a library than a user-facing package and I have put flexibility before ease of use.

I'm thinking this can be used as a start to figure out what a nice looking API for the above use cases could look like.

jeremiedb commented 3 years ago

Maybe a naive question, but would it be possible to convert the model imported by ONNXmutable back into a regular Flux chain?

For example, following:

julia> model = CompGraph(resnetfile);

If the model could be recompose into a the stack of regular Flux building blocks (Conv, Dense, SkipConnexion), then I think it would make the manipulation of the model, such as for image transfer learning very easy and intuitive, just like what is done in the model-zoo tutorial: https://github.com/FluxML/model-zoo/blob/master/tutorials/transfer_learning/transfer_learning.jl. The issue with the latter being that only VGG19 seems functional in Metalhead.

DrChainsaw commented 3 years ago

It certainly could, at least for the type of models which can be expressed as a chain. It is a bit of a mental exercise since the graph format in ONNX is very different from how (non-linear) graphs are expressed with chains. The current deserialization is just a very simple recursion through the graph and I don't think one can do the same to create a chain. Splitting ONNXmutable into multiple packages would however allow for creating a simpler importer which makes use of the same primitives, but it seems like the interest in this is somewhat low.

Another somewhat annoying but certainly overcomeable issue is that one probably needs to resort to wrapping layers in closures and that requires that some side mechanism to provide the parameters is provided.

I also think that the example looks easy because the graph is very linear. The same mechanism would not be so nice in the non-linear case. I guess that the "scrape off the classification head" usecase would be fine for almost all models though.

DrChainsaw commented 3 years ago

Btw, if the graph is linear then one can just do this:

julia> chain = Chain(layer.(vertices(compgraph)[2:end])...)
jeremiedb commented 3 years ago

Thanks a lot for the clarifications! My general impression was that the ability to easily access typical pre-trained building blocks such as the VGG/ResNet models was a quite common need, so I've been a little surprised that it didn't appear as a high priority to the Flux ecosystem. And on the NLP side

The Chain(layer.(vertices(compgraph)[2:end])...) works smoothly withh VGG16/19 models. For ResNet, it seems to struggle:

graph = CompGraph("data/resnet34-v1-7.onnx")
julia> m = Chain(layer.(vertices(graph)[2:end])...)
ERROR: MethodError: no method matching layer(::NaiveNASlib.var"#225#226"{typeof(+)})
Closest candidates are:
  layer(::ONNXmutable.Flatten) at C:\Users\jerem\.julia\packages\ONNXmutable\kxK8z\src\deserialize\constraints.jl:179
  layer(::CompVertex) at C:\Users\jerem\.julia\packages\NaiveNASflux\0lGnm\src\vertex.jl:114
  layer(::NaiveNASflux.InputShapeVertex) at C:\Users\jerem\.julia\packages\NaiveNASflux\0lGnm\src\vertex.jl:18

Also, with the ResNets v2, it fails at the import step:

julia> graph = CompGraph("data/resnet34-v2-7.onnx")
ERROR: MethodError: no method matching (::ONNXmutable.var"#144#145")(::Dict{Symbol,Any})
Closest candidates are:
  #144(::Any, ::Any) at C:\Users\jerem\.julia\packages\ONNXmutable\kxK8z\src\deserialize\ops.jl:214
Stacktrace:
 [1] wrapfrom(::ONNXmutable.OnnxNode, ::ONNXmutable.OnnxNode, ::ONNXmutable.CompGraphBuilder, ::Symbol, ::Dict{Symbol,Any}) at C:\Users\jerem\.julia\packages\ONNXmutable\kxK8z\src\deserialize\combine.jl:47
...

Are the residual connexions the cause the difficulties here? The ONNX models were taken here: https://github.com/onnx/models/tree/master/vision/classification/resnet/model

(let me know it'd be preferable that I open an issue the ONNXmutable repo or elsewhere)

DrChainsaw commented 3 years ago

Thanks for showing interest @jeremiedb, sorry for wall-of-texting you as a response :)

My general impression was that the ability to easily access typical pre-trained building blocks such as the VGG/ResNet models was a quite common need, so I've been a little surprised that it didn't appear as a high priority to the Flux ecosystem.

I think this has been stated a few times, but outside of the group maintaining this tracker it does not seem like there is an overwhelming need, judging by the number of posts on discourse and the influx of issues in ONNX.jl and ONNXmutable.jl. There could certainly be a chicken and egg problem here as people might silently turn back to python when they can't find a canonical ONNX package for any of the Julia ML frameworks.

I would love to discuss my proposal above some more. I'm however a bit hesistant to just go ahead and litter the general registry with ONNX packages without any wider support. Chances are that people will just go "hmmm, unkown single author and I don't understand the point of all of this so I'll just roll my own" if I proceed. I'm happy to do some work if we can work out a structure which we believe in. I will also be happy if the conclusion is that we should wait for some ONNX expert to catch interest instead :)

For ResNet, it seems to struggle:

Yes, sorry for using confusing home-made graph terminology here. A ResNet would not fall into the category of "linear DAGs" as the elementwise summations are nodes in the graph which take input from more than one other node.

The CompGraph basically uses the exact same graph representation as ONNX: Each node is an operation and its input edges describe from which other nodes in the graph it takes input and the output edges describe which other nodes consume the output. These are accessible through inputs(node) and outputs(node) respectively (I ended up using the name vertex instead of node in NaiveNASlib but afaik node and vertex are direct synonyms). With a CompGraph g in hand, one can recurse through the node using these methods and either g.inputs or g.outputs.

I can't say I have though alot about it, but creating an algorithm to transform an arbitrary graph using this representation to a chain does not seem like a trivial task. I'm sure the resnet can be done if one assumes that the only thing one needs to handle are things which fit the SkipConnection (e.g. by memoizing the inputs when there are more than one and wrap in a SkipConnection when you hit one of the memoized inputs again), but how would one do e.g. a ResNext?

If you have a suggestion for the above I'd be happy to accept a PR (or even implement it if you tell me how) in ONNXmutable to have something like Chain(g::GraphProto) or just load(modefile; as=Chain) while pending input for the structure proposal above.

Also, with the ResNets v2, it fails at the import step:

This is definitely an issue with ONNXmutable! Please file an issue and I'll look into it.

FYI the code which fails are the heuristics to combine multiple ONNX nodes into a single CompGraph node. One example of when one might want this are activation functions as ONNX always has them as separate nodes in the graph while Flux allows them to be inside the layers. This is by no means a necessary step, but I though it was nice to allow for things like CUDA optimizations as well as just generally trying to make sure that import -> export returns something which is as close to the same model as possible.

darsnack commented 3 years ago

There are a couple of related issues here too.

First, on the point of basic models/blocks like VGG/ResNet, we have a PR to Metalhead.jl that implements some of these natively in Flux. Ideally, I think we should skip ONNX completely and provide pre-trained models by just training those native Flux models (that was my intent anyways).

Second, there would still be a need beyond basic models for the ability to read in arbitrary ONNX models into Flux models. I agree that this would be very useful functionality, but like @DrChainsaw mentioned, it is a tricky problem. Any solution would probably need to be well-documented so someone can reference what to expect from the translation. That being said, I think this PR will be helpful for expressing more complex graphs. I've already indicated in one of the ML committers calls that the PR should get some attention and be merged.

DrChainsaw commented 3 years ago

Second, there would still be a need beyond basic models for the ability to read in arbitrary ONNX models into Flux models.

I'm not sure I would just blanket equate "Flux models" with "Chain". In my view, a CompGraph is also a Flux model, as is any julia function which makes use of building blocks from Flux (i.e what ONNX.jl tried to do).

That being said, I think this PR will be helpful for expressing more complex graphs.

I still don't fully understand the rationale behind that PR. Isn't the nice thing about the Chain that if your model is super simple so that it only consists of unary operations stringed (well chained actually) you can get rid of alot of complexity as you only need to store the layers as a tuple/array since the structure is implicit from the struct itself?

Once you deviate from this simple assumption, wouldn't it be better to just use the same format as what is more or less tried and true when it comes to DAGs? I see many arguments that Flux should be more like the other frameworks and I guess that if model building deviates the same complaints will come for that too.

darsnack commented 3 years ago

I'm not sure I would just blanket equate "Flux models" with "Chain". In my view, a CompGraph is also a Flux model, as is any julia function which makes use of building blocks from Flux (i.e what ONNX.jl tried to do).

This is true, and it is certainly what makes Flux so powerful. But if I need to manipulate a model, then I'd much rather operate with the same layer building blocks in Flux. There are two reasons for this:

Isn't the nice thing about the Chain that if your model is super simple so that it only consists of unary operations stringed (well chained actually) you can get rid of alot of complexity as you only need to store the layers as a tuple/array since the structure is implicit from the struct itself?

I agree with this too. I don't think Flux should build graphs of the computation like other frameworks. I do think Chain should remain a simple sequentially executed list of functions. That PR doesn't necessarily invalidate that assumption. It would make representing models like Inception easier.

Generally speaking, I am not in favor of adding more layers to Flux. I think it is better to just use closures and arbitrary Julia functions whenever possible. But I think Parallel, Join, and Split have proven to be ubiquitous enough in ML that their addition is warranted. Given the assumption that the input and output to a DAG are truly unary, I think Chain + Parallel + Join + Split lets you include that DAG into a "native" Flux model. Like I mentioned, that translation is tricky and must be well-documented so people know exactly what gets interpreted as what, but the ability to make that translation would be useful for me (and I suspect many others).

opus111 commented 3 years ago

Hi, I am very sorry to join this conversation so late.

I come to Julia from an industrial side, where we are trying to make products. From that point of view FastAI is a repository of best training practices, and ONNX provides access to the best pretrained models. ONNX also provides the critical link to other non-Julia eco-systems. For example, one might want to suck in VGG from ONNX, fine tune it in Flux with best practices from FastAI, save it to ONNX, and deliver it via some large ONNX Runtime system.

So, for my purposes, ONNX does not have to support all DAGs-- just the ones used in popular Vision and NLP pretrained models. I think if we get reading and writing VGG, ResNet, BERT, GPT-2/3 and maybe a few more working, that will make 90% of the developers happy.

FastAI should contain working examples of Transfer learning using ONNX. I would try to match some of the popular Python examples, such as those from Coursea's NLP Transformer class.

Finally, I also strongly agree that we should make it easy to extend ONNX with new primitives so that researchers can come up with new designs, and have them used outside of Julia.

On Sun, Nov 15, 2020 at 11:41 AM Kyle Daruwalla notifications@github.com wrote:

I'm not sure I would just blanket equate "Flux models" with "Chain". In my view, a CompGraph is also a Flux model, as is any julia function which makes use of building blocks from Flux (i.e what ONNX.jl tried to do).

This is true, and it is certainly what makes Flux so powerful. But if I need to manipulate a model, then I'd much rather operate with the same layer building blocks in Flux. There are two reasons for this:

  • Flux layers are pretty well designed for manipulation with standard Julia syntax
  • Most people (myself included for my work) don't want to learn another library to interpret a model

Isn't the nice thing about the Chain that if your model is super simple so that it only consists of unary operations stringed (well chained actually) you can get rid of alot of complexity as you only need to store the layers as a tuple/array since the structure is implicit from the struct itself?

I agree with this too. I don't think Flux should build graphs of the computation like other frameworks. I do think Chain should remain a simple sequentially executed list of functions. That PR doesn't necessarily invalidate that assumption. It would make representing models like Inception easier.

  • Parallel is just a generic form of SkipConnection with more branches. The input and output is still a unary piece of data. In the Metalhead.jl PR that I linked, a closure is used to implement the many branches of an Inception module, since SkipConnection is limited to a two branches. If you mostly want to execute the branching module, then there is little difference here. But if you want to manipulate or access the module's parts, then having a struct/layer like `SkipConnection is useful.
  • Join/Split are where it might appear like the unary structure is broken, but I would argue that the input and output are still "unary" because they are a single tuple. If you use a Split in a Chain, then the Chain only sees the one tuple as the output and not the individual pieces of the tuple. It is up to the person who used the Split in the Chain to make sure that the subsequent layer accepts the tuple as an input. We already do this when we x -> reshape(x, :) between convolution and fully-connected layers.

Generally speaking, I am not in favor of adding more layers to Flux. I think it is better to just use closures and arbitrary Julia functions whenever possible. But I think Parallel, Join, and Split have proven to be ubiquitous enough in ML that their addition is warranted. Given the assumption that the input and output to a DAG are truly unary, I think Chain + Parallel + Join + Split lets you include that DAG into a "native" Flux model. Like I mentioned, that translation is tricky and must be well-documented so people know exactly what gets interpreted as what, but the ability to make that translation would be useful for me (and I suspect many others).

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/FluxML/ML-Coordination-Tracker/issues/10#issuecomment-727599155, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAANIBCLBFA5XDTG5EMR4GTSQAAFDANCNFSM4S5XMUPQ .

DrChainsaw commented 3 years ago

Most people (myself included for my work) don't want to learn another library to interpret a model

Yeah, the essence of this argument is certainly a wrench in Julias narrative about everything being first class so people can just design their own extensions to other packages. I guess it has a pretty strong gravitational effect on adding peripheral stuff to existing packages as else one needs to create an agreement on what is the canonical package for it or forever suffer the lisp curse (disclaimer: I'm not a lisper, I saw someone mention this as being the same issue as julias ecosystem and thought it was spot on).

I don't want to argue this too much because I'm really not that dug in as to what I believe is right, but:

But I think Parallel, Join, and Split have proven to be ubiquitous enough in ML that their addition is warranted.

Like I mentioned, that translation is tricky and must be well-documented

Those layers are not part of flux now, so learning to make use of them in the context of a Chain might be comparable to learning another library. Other popular frameworks I know of make use of the "traditional DAG" formulation to support them (or just let just leave it up to the user to define the function). Why make things more difficult than it has to be?

Anyways, if you disagree but don't want to argue it further then no need to reply. As I stated below, I think it is a non-issue for this work if we go the eco-system route.

@opus111

I'm not so familiar with transformers. Given that BERT and GPT can be expressed as ONNX graphs the graph can also be expressed with the CompGraph format. Do you think they can be expressed using Split, Join and Parallel form the Flux PR linked above, or do one also need to add something like MultiHeadAttention to Flux? Note that MultiHeadAttention is not an ONNX operation so recognizing it from its graph of primitives might not be an easy task. I tried opening GPT-2in netron and I must admit that the result made me a bit dizzy (not that I expected anything else) and it was not clear at all how one would express it as a Chain (which may very well be due to the sheer size of the model ofc).

So, for my purposes, ONNX does not have to support all DAGs-- just the ones used in popular Vision and NLP pretrained models.

Yeah, this is pretty much how I went about it as well (the "for my purposes" part that is) and I think it is the only scalable approach unless some big corp decides to throw alot of people at this.

I'm not sure if you mean anything special when you said "all DAGs", I'd like to separate 'operations' from 'DAGs' where the latter just describes the sequence of 'operations' to apply. A missing operation is typically easier to fix the day you realize you need it compared to fixing a missing subset of DAGs. Conversely, supporting all 'DAGs', but not all 'operations' from day one is pretty straight forward if you just copy the DAG format from ONNX.

I'm happy to merge PRs for the OPs you need in your business, and probably even implement a few of them if you file issues (unless if there is anything ethically wrong about doing so).

As I said above, I'm also happy to dismantle ONNXmutable into several packages so that someone can build the ONNX <-> Chain importer without having to either reimplement all the primitives or depend on the CompGraph packages.

I mean, it's not that hard to write them from scratch, but at least I managed to mess up a double digit number of things like not reversing/interleaving arrays along certain dimensions in ways which did not even show up when running onnx's own test suite. Thats why I ended up using onnxruntime as a test dependency so that the test suite checks that all OPs (and a handful of models with more OPs stringed together) produce the same output for the same input.

Again, the "Most people don't want to learn another library" is pretty much why I don't go ahead and make those packages I suggested in the OP. Again, I don't care what my own github looks like ofc, but I prefer to not litter the general registry with packages for a single person ONNX ecosystem and I don't want to deal with dependencies to non-registered packages as anything but an extremely temporary situation.

Although I'm happy there is some attention to this I'm not sure if the comments in this issue means that there is a strong disagreement to the original proposal or if we just ended up accidentally bikeshedding the DAG format (which is a non-issue if we go the eco-system route imo). Would it be possible to zoom back in on that and see if it might satisfy all requirement or is there some other proposal on how to proceed?

ToucheSir commented 3 years ago

Part of the difficulty with this discussion is that neither PyTorch nor TF handle importing from ONNX. That said, https://mxnet.apache.org/versions/1.7.0/api/python/docs/tutorials/packages/onnx/fine_tuning_gluon.html#Fine-Tuning-the-ONNX-model is a decent e2e transfer learning example, so perhaps we shouldn't be looking at those two frameworks at all!

The MXNet implementation looks pretty straightforward, but it's operating at more of the TF 1.0/NNlib level. This seems to be a good sweet spot—the JAX repo has a similar (toy) implementation. What if CompGraph (or whatever structure ends up being used for storing imported models) was simply implemented in terms of NNlib operations? Params gathering would just be a matter of adding Functor support, and (if it's not there already), "scraping off" layers would be as simple as a getindex override.

Note that the above was all about importing. WRT exporting, I suppose the only options are some kind of tracing framework (e.g. what ONNXMutable uses now) or some extra machinery on top of Functors.jl. Assuming the former wins out, why not trace all the way down to NNlib and Base functions? Yes ONNX is conceptually a DAG, but it's also a full IR with control flow and everything. https://github.com/FluxML/XLA.jl was a good POC, so (given a less flaky Mjolnir equivalent) I think Flux and ONNX could be decoupled here as well.

Edit: I think this could also resolve the DAG format discussion, if only because there wouldn't need to be a shared DAG between Flux and ONNX(Mutable). Creating "a less flaky Mjolnir" may be the sticking point though.

DrChainsaw commented 3 years ago

@ToucheSir

I think that the actual way one is supposed to use ONNX is an excellent point which I did not address in the OP, partially because I don't have a satisfactory answer.

I have also gotten the impression over time that ONNX is in practice less of an exchange between frameworks and that something like onnxruntime is the end goal for ONNX models. I guess the major frameworks rely on being big enough so that they don't need to exchange models, or do they use some other format when exchanging between frameworks?

My current take on this is that doing an ONNX import/export is still more general than picking one framework and do their export/import or trying to cover all. I have made a throw-away limited import/export from Keras (basically me being the stubborn guy who refused to use python) and I don't think that format was much easier to import into Flux compared to ONNX.

I pretty much made ONNXmutable mostly with long term storage of models in mind, and for this use case it is ofc very valuable that a loading a model should be as close as possible to what was saved. I'm not sure if this requirement is in direct conflict with the transfer learning or more general import case, but lets explore it a bit.

Right now, CompGraph is just an expression of the computational graph without any knowledge of its operations. Its literally just the description of where the output of an operation goes. As such, I don't make the connection between the getindex override to NNlib ops.

Using NNlib operations for some export cases is certainly worth exploring to see if it simplifies and generalizes some aspect. I don't think it is ideal in all cases as ONNX defines "bigger" ops (e.g. Gemm vs MatMul) to facilitate optimization, but maybe good ONNX importers can do these optimizations themselves. Another issue is that I'm not certain that ONNX supports "free" parameters in the model, so things like the bias in conv layers which are not visible inside NNlib could cause issues. I also like to be able to use e.g. netron and be able to recognize the model I just exported which may or may not be a requirement worth considering.

Just to be clear, ONNXmutable does trace down to Base primitives like cat, +, - etc, including broadcasted versions of those where applicable.

To start with the most primitive of the primitives certainly make sense if one starts from scratch and aims to cover the whole spec. The approach I have taken so far is to add the stuff I need and just keep doing that for as long as I need something which I think is a pretty decent pragmatic approach all in all. I don't think either approach excludes the other though.

The import case is a bit less clear if it is generally valuable. Firstly, you need something to hold the parameters and I'm not sure it is the most user friendly thing in a transer learning context to return opaque closures. I guess one way to go about it is to have a generic Node (maybe <:AbstractVertex if using CompGraph) which holds the params and some function which takes the params and the input. For ONNX ops which have a clear 1<-> 1 in Flux, would there be any advantage of having native layer implementations like this? I'm thinking again about the "Most people don't want to learn another library" argument which is also why I tried to make use of existing Flux implementations to the largest extent possible.

I wanted to use Functors for the export rather than messing with tracing, but it just seemed to be designed for a completely different purpose. In particular, it does not tell you anything of the program flow. It also does not penetrate inside closures (e.g. the combine method in SkipConnection) and functions with default values so I gave up on that idea. Do you have any idea on how to make use of it?

Yes ONNX is conceptually a DAG, but it's also a full IR with control flow and everything.

I might not be enough computer scientist to accurately speak here, but the ONNX IR does not define control flow afaik. It defines node operations like Loop and If which are encapsulated within a single node in the computation graph. This can easily be done with CompGraph and Chain as well since the verices/layers can have any operation in them. Or can the graphs inside e.g. an if node contain nodes of the 'outer' graph? I guess at least loops in this manner and still call the graph 'acyclic' would be impossible...

If your point was that the current tracing approach which only uses dispatch will choke on Julia control code then you are absolutely correct. Mjolnir as you point out seems like the perfect tracing tool and I have many times thought that I should give it a go, but never found the energy.

One thing I don't think Mjolnir/XLA solves is the world age problem which again is why some Julia-defined structure (like Chain or CompGraph) is needed to describe the DAG. As I stated in the OP, if one is content with a slim ONNX package which can only import into global scope it is straight forward to generate a Julia IR from either of the three formats (ONNX graph, CompGraph or Chain) and I think it can also be made functor compatible. I don't think Mjolnir is needed for this even.

I fully agree that Flux and ONNX can be (and in fact are) fully decoupled. To build something like onnxruntime in Julia would not require one to depend on Flux. It seems from the discussion above though as if adherence to Flux is very desirable from a transfer learning perspective. Perhaps there does not need to be one package which fulfillls all requirements though and that leads back to the eco-system discussion (ie. what packages should there be and what are their responsibilities) which I think is the right discussion to have.

Note to self: Practice being more concise in online conversations. I don't think anyone will read this post up to this line so I guess I can say whatever I want here. (can't think of anything to say since I have basically used all words in post above)

darsnack commented 3 years ago

The MXNet implementation looks pretty straightforward, but it's operating at more of the TF 1.0/NNlib level. This seems to be a good sweet spot

Yeah I think there is value in an ONNX representation. I am certainly not advocating for a system where all ONNX models are represented by Flux layers. I think the MXNet implementation looks more like what I was thinking. Some kind of ONNX2Flux.jl package that translates CompGraph to Flux layers.

Yeah, the essence of this argument is certainly a wrench in Julias narrative about everything being first class so people can just design their own extensions to other packages.

I think multiple representations can exist at the same time. Some people will use CompGraph, and some people will prefer to switch to Flux layers. We don't have to put all functionality inside Flux or force everything outside of it. Being first class means the user gets to choose and not suffer for their choice.

ToucheSir commented 3 years ago

I don't think anyone will read this post up to this line

Too late now :smile:. I think this kind of deep design discussion could definitely benefit from some synchronous communication. We should schedule a call some time soon or add an agenda item to one of the bi-weekly calls.

In the meantime:

I guess the major frameworks rely on being big enough so that they don't need to exchange models, or do they use some other format when exchanging between frameworks?

Yes and nope. That AIUI was the genesis of ONNX, NNEF etc. IIRC ONNX Runtime was a later development.

I don't think it is ideal in all cases as ONNX defines "bigger" ops (e.g. Gemm vs MatMul)

This is exactly the granularity NNlib and LinearAlgebra in the stdlib work at. If anything, Flux's layer API would be too high-level. I agree wrt. "higher-level" ops like RNN though, having those and MatMuls in the same opset feels like bad design.

ONNX supports "free" parameters in the model, so things like the bias in conv layers which are not visible inside NNlib

If you're referring to autodiff, that's where Functors.jl (the mechanism behind Flux.params) comes in. In practice, params(m::CompGraph) could simply return a collection of all the free parameters just like how Flux layers do. I think optimizers should just work as well!

Just to be clear, ONNXmutable does trace down to Base primitives like cat, +, - etc...The approach I have taken so far is to add the stuff I need and just keep doing that...

:+1:

would there be any advantage of having native layer implementations like this? I'm thinking again about the "Most people don't want to learn another library" argument which is also why I tried to make use of existing Flux implementations to the largest extent possible.

Yup, compat with Knet for one. I don't see any problem with bootstrapping with Flux, but mid-long term NNlib+potentially NNcore would be a better common substrate to target. @darsnack I think this applies to your ONNX2Flux.jl point too.

I wanted to use Functors for the export...Do you have any idea on how to make use of it?

Not really, based on your findings it may be a dead end. I'm personally more in camp tracing, but that's very much a personal bias towards the "ML is a interpreter/compiler problem" school of thought.

...It defines node operations like Loop and If...

So does e.g. WASM, yet that is 100% Turing complete. I'm glossing over call et. al and ONNX IR is not Turing complete AFAIK, but that doesn't feel like a limitation for ML models specifically. Perhaps Mjolnir-style inlining and partial evaluation is mandatory for this though, I've not looked into it either.

One thing I don't think Mjolnir/XLA solves is the world age problem...global scope

I'm pretty hazy on world-age issues, but the global scope thing is not a problem with XLA. AIUI ONNX.jl literally generates and evals source code, whereas Mjolnir (really IRTools) is directly manipulating a partially lowered representation that the compiler middle end works with. There's certainly nothing stopping one from XLA'ing a function in another nested function and passing that to a third function in another module.

It seems from the discussion above though as if adherence to Flux is very desirable from a transfer learning perspective.

Does the following functionality constitute Flux adherence?

  1. Can get params with params
  2. Can use AD + train with gradient (or ForwardDiff, or...)
  3. Can place into a Chain or other layer container as-is
  4. Can split off a piece to use in 1.-3.

Is there anything else you think is necessary for transfer learning? My (limited) experience is that the pre-trained backbone is treated as a black box and not much interrogation of the internals happens beyond splitting and figuring out input/output sizes.

darsnack commented 3 years ago

I think this kind of deep design discussion could definitely benefit from some synchronous communication.

Yes I agree! Of course, I definitely don't know as much about ONNX as everyone else in this thread.

Does the following functionality constitute Flux adherence?

I think this is the right list to base the design of an ONNX package on. Stuff like what I brought up (converting ONNX models to Flux) is a separate and orthogonal issue.

I don't think anyone will read this post up to this line

😅 I did too!

DrChainsaw commented 3 years ago

I think this kind of deep design discussion could definitely benefit from some synchronous communication.

Yes, perhaps we can take a shot at crystalizing the requirements (maybe they should be called wishes in this context) and see if some components can form from this.

I agree wrt. "higher-level" ops like RNN though, having those and MatMuls in the same opset feels like bad design.

I was mainly thinking about those. I agree that its not beautiful that they are in the same opset. I saw the motivation somewhere being just performance reasons (e.g. CUDA). At least it makes me feel a little bit less stupid for dreading the task of trying to come up with an algorithm which recognizes subsets of the DAG as equivalent to members of some set of higher level operations. If one tries to keep the interoperability ambition alive it also makes sense that importing e.g. a recurrent model actually yields something recognizable as a recurrent model.

If you're referring to autodiff, ...

I was actually thinking in completely wrong terms here, no idea way. Of course ONNX supports what I called dangling parameters above. That is what the initializers do and that is the only mechanism to have parameters in the model at all. Just FYI, CompGraph does indeed support Functors, so params(g::CompGraph) returns all trainable parameters in the model. I also see no reason why optimizers should not work.

Yup, compat with Knet for one.

Thats great! I somehow thought only Flux made use of NNlib which is the reason for my scepticism above. This is a strong argument for defining primitives on that level. I guess one package which should be made is the one with NNlib primitives for ONNX. Nnlib2Onnx?

Mjolnir (really IRTools) is directly manipulating a partially lowered representation that the compiler middle end works with. There's certainly nothing stopping one from XLA'ing a function in another nested function and passing that to a third function in another module.

This is outside of my understanding atm, but isn't there some limitation on how dynamic this can be? I can see how XLA:ing an existing function could end up confining everything so that the compiler can do its thing, but would it be possible to do this when e.g. passing a string (or a GraphProto) to a function which XLA:s whatever function that string (GraphProto) happens to represent and passes it on to a third function? I don't think this is something which needs to be resolved at this point. I guess I'm just greedily trying to obtain a small nugget of knowledge at a low cost here :)

Anyways, given that my thought block on parameters has resolved, I certainly see how one could make a potentially very simple (as in simple to build) ONNX importer by just copying pretty much the exact ONNX IR format and function APIs. Fetching params is as you hinted probably 'just' a matter of labelling up all initializers based on whether they are labeled as differentiable in the OP they are input to. In the simplest form there might be a bit too much Dict:ing around with strings for performance to be satisfactory though.

Stuff like what I brought up (converting ONNX models to Flux) is a separate and orthogonal issue.

I still think this is an important requirement to capture and I also believe this is in spirit with what ONNX tries to achieve. I do believe there is a set of reusable components (packages) which allow these things to coexist without an onerous amount of duplication.

jeremiedb commented 3 years ago

Hope not to get too tangential here, but in order to get a better grasp of the implications, here is my understanding of the current status:

Currently, ONNXmutable is able to represent a model as a Flux Chain for such "lienar" scenario: image

Where things get tricky are for those kind of model: image

Such models are ubiquitous, whether it is Inception, Resnet or language models with multi-head attentions in Transformers. Typically, in most most frameworks, those kind of more complex operations can be abstracted by defining some kind of "blocks", or in Flux a custom Functor, allowing to abstract things back into a "linear" form: image

ONNXmutable can handle those more complex models to the extent that they are defined from their granular structure, rather than as a "block" operator. Which makes quite some sense as there can be an infinite number of variations of "blocks".

Where I'm uncertain about the proper direction, is what would be a convenient way to manipulate models incorporating those more complex operations or "blocks". The ONNXmutable way appears like one. Seems like Flux cannot in its current form, unless would be some mechanism to merge the "B-C-D" operations into a "B" Functor operator which would then be Chain compatible?

I'm not necessarily advocating for a Chain compatibility, but I value in the universality of the representation of model, in order to both manipulate, share and reuse them. Although there seems to be a movement away from the symbolic representation of computation graph towards a more imperative flavor, in the end I still understand any model from a computation graph perspective. With complex models, I see much value in the ability to visualize that graph and perhaps bring the modifications from that visualization of the graph. I feel there's a key missing component in Flux in that regard as it's Chain layering seems to fall short to speak to ONNX ResNet/Inception representations. I still miss some perspective to judge whether the ONNXmutable/Compgraph or the addition of parallel/split/join verbs would be best fit that role, but either way, I think it should be treated as a core "Flux" citizen.

That being said, I doubt being well placed for handling the implementation components of such solutions. I'm however able to give some time for support around it and happy to share my user experience feedback as needed. If a call is schedule, I'm also open to participate.

And thanks everyone for their inputs on the subject, it has been great food for thought!

ToucheSir commented 3 years ago

As much as I enjoy the simplicity and theoretical purity of a symbolic graph, practical usage seems to indicate it's a terrible representation for debugging, introspection and ergonomics in general. That said, modern frameworks like JAX and torch.script show that the two are not necessary mutually exclusive.

Don't have much of an opinion on where parallel/split/join should live, but they could make sense as decorators or combinators for the ONNX library to use. I'm specifically thinking of n>1-ary operators like +. However, I'm not sure having explicit constructs for all of these are necessary.

Warning: everything below is highly speculative!

Looking at this model again:

image

Because the structure is a DAG, it can be unrolled in topological order to a linear trace of sorts:

ins = [:in]
outs = [:e]
ENV = Dict(...)
ENV[:a] = conv(ENV[:in])
ENV[:b] = conv(ENV[:a])
ENV[:c] = conv(ENV[:a])
ENV[:d] = +(ENV[:b], ENV[:c])
ENV[:e] = conv(ENV[:d])

where ins and outs are used to decide which/how many elements to take in and output from the network.

Splitting a network for e.g. transfer learning could be done like so:

ins = [:in]
outs = [:b, :c] # this could be any variable/symbol that is not used by a later op, or explicitly specified during the splitting operation
ENV = Dict(...)
ENV[:a] = conv(ENV[:in])
ENV[:b] = conv(ENV[:a])
ENV[:c] = conv(ENV[:a])
# -- cut at this layer --
# ENV[:d] = +(ENV[:b], ENV[:c])
# ENV[:e] = conv(ENV[:d])

This presumes a good enough tracing and codegen apparatus, so I'm not sure how tenable it is with our current infrastructure. Julia is definitely a better fit than Python for such an approach though.

DrChainsaw commented 3 years ago

As much as I enjoy the simplicity and theoretical purity of a symbolic graph, practical usage seems to indicate it's a terrible representation for debugging, introspection and ergonomics in general

Could this be a combination of a) bad experience with tensorflows opaque graph format and b) essential complexity due to general DAGs vs simple sequential models?

I have spent quite a bit of effort to make RePL messing around with CompGraph a pretty ergonomic experience which was a natural consequence of actually using the package. There are obviously things which can be improved, convenience functions to be written etc.

I think the differences compared to Chain is due to the higher degree of generality. I do think there is a certain degree of attractiveness in the strategy to make everything in the chain a unary op (ie. split,join parallell), but I think that approach can only take you so far, and when models become sufficiently complex I don't think it makes model manipulation easier.

Because the structure is a DAG, it can be unrolled in topological order to a linear trace of sorts:

This unrolling is basically what happens when executing the CompGraph, and thanks to julias dynamic dispatch this can be used also to unroll it into almost anything, including a tape expression like the above or an ONNX GraphProto.

I have explored this a little bit and I have not found a way to get around the world age barrier. Here is a discourse post on that matter: https://discourse.julialang.org/t/improve-performance-of-computation-graph-evaluation/32873 which covers that.

There exist GeneralizedGenerated and RuntimeGeneratedFunctions but they come with caveats. I have tried the former without success but haven't gotten around to the latter.

In case you want to play around with it I can dig up my compile-CompGraph-as-a-function script to save you a few minutes. I did try the 'mutable references' tape approach suggested in the discourse thread and it did not perform better than dict memoization, but I didn't go out of my way to optimize it. I have the code for that too somewhere in case you'd like to give it a spin. Btw, the issues I mentioned in the thread seem to have been resolved in Zygote so I don't feel a pressing need to improve the CompGraph execution performance.

Splitting a network for e.g. transfer learning could be done like so:

Yup, but why would this be considered easier to do with a Julia expression compared to a structure which is already designed to do exactly this? Fwiw, this is a pretty deep rabbit hole where simple cases makes one believe that this is straight forward. The ENAS paper refered to this as "butterfly effects" and pretty much copped out by limiting the search space imo. This is something which is doable when you are in control of the search space (i.e user is not allowed to make arbitrary modifications on the graph, everything is pre-baked).

In my mind, offering something which just fails (e.g. results in a corrupt/misaligned model) for things which look perfectly reasonable to do from the API is not desirable, especially when NaiveNASlib with a pretty high confidence already takes care of this. Its much easier to modify or create a new API for NaiveNASlib than it is to build the thing that works for the simple cases and then spend the rest of eternity to patch every single issue which pops up with special handling.

@jeremiedb

I think this is a very good summary of the current state of model representation and I don't think I have any definite answers to the issues you raised. One way to defer this discussion is to build the ONNX ecosystem so that model representation can be build independend of the lions share of the code (which is op conversion), one caveat being covered below. I need to put answering this topic aside as work calls, but I'll try to give a proper reply when I find the time.

I think it should be treated as a core "Flux" citizen.

I'm a bit torn on this one. I like Flux being bare bones and advocating to just write everything as a function. I can however see the difficulty of letting the eco-system come up with many alternative model representations when that is not enough (programatic model manipulation being one). As stated before, had there existed a canonical compgraph package in julia when I started NaiveNASlib I would have tried to build on top of that. I started using LightGraphs, but it turned out to not be the right tool for the job for various reasons.

darsnack commented 3 years ago

I think that approach can only take you so far, and when models become sufficiently complex I don't think it makes model manipulation easier.

I agree with this, especially from the perspective of NAS. If the goal involves lots of arbitrary model manipulation (e.g. programmatically directed manipulation), then it's better to work with a graph-based representation.

why would this be considered easier to do with a Julia expression compared to a structure which is already designed to do exactly this

I would say the reasons are similar to working with Exprs to represent an AST vs an actual tree data structure. The easiest/most ergonomic interaction with the AST is when you manipulate locally à la MacroTools' prewalk/postwalk. In this localized pattern, it is just ever so slightly more convenient when the data is the current Expr node itself and not the full graph. Can you perform the same pattern by walking the graph? Absolutely, but it is just slightly less attractive. The Expr tree also has its limitations. If you want to perform a transformation that looks outside the current node, then it quickly becomes more useful to work with the graph representation.

Flux has one more limitation in this regard that @DrChainsaw touched on. While Expr can be "complete" because we know the full set of expressions in a Julia program, Flux layers cannot. So, you'll always be playing catch-up against a generic graph representation to match expressiveness.

I like Flux being bare bones and advocating to just write everything as a function.

Definitely, but functions are limited in that their lifetime is constrained around the function call, and they don't exist beyond that. So unless you are willing to pass in parameters to every function call, you need something more permanent to capture state. Closures can give you this, but they lack specificity (i.e. they are anonymous and their fields are not consistent). So, while a higher order function can capture some state and return a closure, that state is not easily referenced. But (in Julia) closures are just anonymous structs, so we come full circle to why structs are useful. I would push that we want everything as functions, but structs are just stateful, named functions.

And in Julia, structs are not just collections of data, but also types. As I talked about in this comment, there is a huge step in ergonomics when something is lifted into the type system. Julia's core strength are built around the type system. Parallel would do this for the kind of graph that @jeremiedb mentioned, and it would be standardized (i.e. no different "blocks" for every model writer).

One way to defer this discussion is to build the ONNX ecosystem so that model representation can be build independend of the lions share of the code

I think this is the right approach too. A design call would be good, but we are getting too stuck in the mud on which (if any) type of graph should go into Flux as a layer. The ONNX side of things can exists completely independent of whether Parallel is/isn't part of Flux. Later down the line, we can talk about an ONNX2Flux.jl package, but it seems like counting our chickens before they hatch if we do that without the ONNX part.

opus111 commented 3 years ago

One possibility is adding Stack Semantics to Chain. This is the approach used by Google Brain, and was discussed in the topic "FluxTracks?" on Zulip.

The idea is that operations pop inputs from the top of a stack, and push their outputs. To split values, one just duplicates them on the stack, and merging just requires popping more than one value. Splits, Residuals and Combinations can all be implemented this way.

The ONNX API would then read and write Flux Chains with these new features.

Here is an Python example of a multi-headed Transformer Decoder in Trax from Coursera's NLP Course


def CausalAttention(d_feature, 
                    n_heads, 
                    compute_attention_heads_closure=compute_attention_heads_closure,
                    dot_product_self_attention=dot_product_self_attention,
                    compute_attention_output_closure=compute_attention_output_closure,
                    mode='train'):
    """Transformer-style multi-headed causal attention.

    Args:
        d_feature (int):  dimensionality of feature embedding.
        n_heads (int): number of attention heads.
        compute_attention_heads_closure (function): Closure around compute_attention heads.
        dot_product_self_attention (function): dot_product_self_attention function. 
        compute_attention_output_closure (function): Closure around compute_attention_output. 
        mode (str): 'train' or 'eval'.

    Returns:
        trax.layers.combinators.Serial: Multi-headed self-attention model.
    """

    assert d_feature % n_heads == 0
    d_head = d_feature // n_heads

    ComputeAttentionHeads = tl.Fn('AttnHeads', compute_attention_heads_closure(n_heads, d_head), n_out=1)

    return tl.Serial(
        tl.Branch( # creates three towers for one input, takes activations and creates queries keys and values
            [tl.Dense(d_feature), ComputeAttentionHeads], # queries
            [tl.Dense(d_feature), ComputeAttentionHeads], # keys
            [tl.Dense(d_feature), ComputeAttentionHeads], # values
        ),

        tl.Fn('DotProductAttn', dot_product_self_attention, n_out=1), # takes QKV
        # HINT: The second argument to tl.Fn() is an uncalled function
        # Since you are dealing with closures you might need to call the outer 
        # function with the correct parameters to get the actual uncalled function.
        tl.Fn('AttnOutput', compute_attention_output_closure(n_heads, d_head), n_out=1), # to allow for parallel
        tl.Dense(d_feature) # Final dense layer
    )

def DecoderBlock(d_model, d_ff, n_heads,
                 dropout, mode, ff_activation):
    """Returns a list of layers that implements a Transformer decoder block.

    The input is an activation tensor.

    Args:
        d_model (int):  depth of embedding.
        d_ff (int): depth of feed-forward layer.
        n_heads (int): number of attention heads.
        dropout (float): dropout rate (how much to drop out).
        mode (str): 'train' or 'eval'.
        ff_activation (function): the non-linearity in feed-forward layer.

    Returns:
        list: list of trax.layers.combinators.Serial that maps an activation tensor to an activation tensor.
    """

    # Create masked multi-head attention block using CausalAttention function
    causal_attention = CausalAttention( 
                        d_model,
                        n_heads=n_heads,
                        mode=mode
                        )

    # Create feed-forward block (list) with two dense layers with dropout and input normalized
    feed_forward = [ 
        # Normalize layer inputs
        tl.LayerNorm(),
        # Add first feed forward (dense) layer (don't forget to set the correct value for n_units)
        tl.Dense(d_ff),
        # Add activation function passed in as a parameter (you need to call it!)
        ff_activation(), # Generally ReLU
        # Add dropout with rate and mode specified (i.e., don't use dropout during evaluation)
        tl.Dropout(dropout,mode=mode),
        # Add second feed forward layer (don't forget to set the correct value for n_units)
        tl.Dense(d_model),
        # Add dropout with rate and mode specified (i.e., don't use dropout during evaluation)
        tl.Dropout(dropout,mode=mode)
    ]

    # Add list of two Residual blocks: the attention with normalization and dropout and feed-forward blocks
    return [
      tl.Residual(
          # Normalize layer input
          tl.LayerNorm(),
          # Add causal attention block previously defined (without parentheses)
          causal_attention,
          # Add dropout with rate and mode specified
          tl.Dropout(dropout,mode=mode)
        ),
      tl.Residual(
          # Add feed forward block (without parentheses)
          feed_forward
        ),
      ]
opus111 commented 3 years ago

And here is some documentation on Trax

https://trax-ml.readthedocs.io/en/latest/notebooks/layers_intro.html#2.-Inputs-and-Outputs

ToucheSir commented 3 years ago

Keeping this brief to incentivize a design call, but in short I was mistakenly assuming that ONNXMutable wanted to get rid of NaiveNASlib altogether because the graph representation was insufficient for certain use-cases. Since that's not the case and CompGraph seems to have great "Flux adherence" already, you can safely disregard all my bumbling on about codegen :)

My only remaining question is whether operations like RNN can be moved into an interface package so that ONNXMutable (side note: can we re-purpose the ONNX.jl repo + name for this? It's effectively archived anyhow) doesn't have to explictly depend on Flux.


Also, just to wrap up my off-topic digression about symbolic graphs:

Could this be a combination of a) bad experience with tensorflows opaque graph format and b) essential complexity due to general DAGs vs simple sequential models?

It was meant as more of a general statement/observation on gripes I've seen frequently expressed by ML framework users. Personally speaking, I was thinking not of CompGraph, but more of how "list of layers" containers like Chain, torch.nn.Sequential, flax/stax.Serial, tf.keras.Sequential et al. complicate stack traces, debugging, etc. These building blocks are irrefutably useful, but there's definitely a need for functionality or tooling that helps with diagnosing inter-layer issues like shape mismatches. That's a discussion for another thread though, so I'll stop polluting this one :)

DrChainsaw commented 3 years ago

I was mistakenly assuming that ONNXMutable wanted to get rid of NaiveNASlib altogether because the graph representation was insufficient for certain use-cases.

Sorry if I seem indecisive about this, but there are certainly disadvantages which don't really have so much to do with the representation itself. For the automatic size alignment stuff do work, one needs to be able to formulate the size constraints as a MILP constraint. The vast majority of ONNX ops fits into one of the pre-baked types in NaiveNASlib and require no further effort, but there is a significant portion which doesn't, for example Reshape and Flatten.

Just to be clear, the model works and can be trained as normal even without this, but if you try to modify the graph structure the "promise" of NaiveNASlib to keep the graph shape-aligned might be broken. While one can argue something like "well, what works works and thats better than nothing" it has the risk of providing an unpolished feel and a sense of uncertainty as to whether things will really work out or not.

Just adding the constraints is of course an option and it is what I have been doing until now, but given that this might well be outside the comfort zone of potential contributors (it certainly is outside mine :) ) along with the possibility that some OPs might just not be possible to express in a MILP problem it seems like an unnecessary constraint (pun kinda intended) to put on development.

Ideally imo the current ONNXmutable would be just one out of a handful ONNX import package for people who are serious about graph modification and are prepared to pay the price of adding constraints for new ops. Question is how to slice things to prevent that basically the same stuff is reimplemented in each package. This is what I tried to break down in the OP.

Making a super simple CompGraph similar to the one in NaiveNASlib but without the mutation stuff is not many lines of code so it does not have to be any cost to have the implementation inside a more generic ONNX importer.

vtjeng commented 3 years ago

Btw, if the graph is linear then one can just do this:

julia> chain = Chain(layer.(vertices(compgraph)[2:end])...)

@DrChainsaw - is it possible to determine if the graph is linear (and if this transformation results in the same output)?

(Either way, this would be worth documenting in ONNXmutable - this is what I was really going for with https://github.com/DrChainsaw/ONNXmutable.jl/issues/47!)

DrChainsaw commented 3 years ago

this is what I was really going for

Hehe, there really seems to be very little love for the CompGraph format.

is it possible to determine if the graph is linear (and if this transformation results in the same output)?

I'm not graph expert and I have certainly been burned by how difficult it seems to be to write code that reasons about graphs, but I think that a DAG is "linear" if all nodes except the input and output node has exactly one input and one output edge.

vtjeng commented 3 years ago

Hehe, there really seems to be very little love for the CompGraph format.

I think the reason why I wanted to see the 'underlying' FluxML representation was that I was trying to figure out what layers / ops I would have to enable to work with JuMP variables for my optimization problem (e.g. Dense, relu, ...), and I didn't really understand what CompGraph was doing on top of that (what additional methods would I have to provide definitions for so I can reliably guarantee that importing a particular type of network would work out of the box?)

DrChainsaw commented 3 years ago

what additional methods would I have to provide definitions for so I can reliably guarantee that importing a particular type of network would work out of the box?

There should not be a need for any. Just like the Chain, the CompGraph vertices can have any function inside them and when evaluated the graph will just pass the outputs to the right nodes just like Chain does. Chain is really just a much simpler CompGraph which only supports what is in here called a 'linear' graph.

If you want to be able to mutate the graph (change size or layers or remove/add nodes/edges) and have NaiveNASlib ensure sizes of parameters across the whole graph is still consistent then you also need to provide metadata for that or write the constraints yourself, but to just evaluate it nothing extra is needed.

lpiert commented 2 years ago

Why is there no mature package that can simply load onnx, and then enter the parameters required by the model to get the predicted result.

DrChainsaw commented 2 years ago

Why is there no mature package that can simply load onnx, and then enter the parameters required by the model to get the predicted result.

To just satisfy your wish without trying to interpret it, you have https://github.com/jw3126/ONNXRunTime.jl which wraps microsofts onnxruntime which afaik is the most complete implementation of the ONNX spec. This will allow you to do both inference and training from Julia, but ofc any Julia native AD will not work.

Opinion piece

Despite the fact that most popular deeplearning operations look the same on the surface level, they dont have a global standard of how to implement them which everyone just follows. ONNX is an attempt to consolidate this, but it is really just another implementation with the classical standards problem.

The ONNX spec is sprawling with alot of operators (many which are a bit fringe and situational), multiple versions of each operator and multiple flags and configuration parameters for most of the operators. Just from the sheer volume of things to implement (and test and maintain) "full support" is a huge task. If you browse the ONNX issues and discussion forums there are a number of requests to keep the standard implementable (which are met with sympathies but also the standard "but the market wants it" justification which is the predominant source of feature creep in all software imo).

The ONNX.jl repo now has the goal to be pretty much a faithful implementation of the spec (rather than the doomed to hit a deadend approach I went for in ONNXNaiveNASflux). It is however not backed by any megacrop which means that amount of developer hours is going to be a bottleneck.

I guess that noone in their right mind is going to sit and just implement all versions of all operators in the ONNX spec on their free time. Although I have no insight into onnxruntime, I'm 99.999% sure that the people who work on it are salaried by microsoft to do so and that even they have a prioritized backlog of what is most useful to do. If ONNX.jl is going to fly, it will need individual contributors to get anywhere. At least for me, adding support for an operator which would allow me to solve some problem I have right now is alot more rewarding, and over time it ought to build up to some kind of support for the most used parts of the spec. It is a bit of a different mindset than in the python ecosystem where the expecation is that everything you would ever need is written by some megacorp in C/C++.

One is ofc justified to ask "why should I implement it in Julia when it is already available in python?" and for this there are no good answers (if one is looking at it from a pure minimum effort to solve a problem). I think this is why e.g. Julia Computing seems to focus on the use cases where python does not have a good story (e.g the SciML stuff) to leverage Julias strength.

darsnack commented 2 years ago

We've started overhauling ONNX.jl since this issue, and we have tracking on that repo. So I am closing this.