Juice for Deep Reinforcement Learning

hweom commented 2 years ago

Hello,

I'm investigating a potential use of Juice framework for deep reinforcement learning (I'm also learning the RL and deep learning as I go, so apologies for potentially newbie questions). RL requires simultaneous learning and using the net for predictions. I've found several issues which I'm not sure are design decisions or implementation shortcuts:

It looks like if I configure the net for minibatch training, I can't then use it to make predictions on just one input. I get this exception when I try to.
Solver has a network() method, comment for which says that "This is the recommended method to get a usable trained network." However, you can't call forward() on it, since it requires a mut ref.

I can probably work around 1 (like artificially creating a batch by replicating a single input vector) and 2 (by using mut_network()), but it doesn't look right.

Is this something that can (should?) be fixed in the implementation? I'm happy to provide PRs (but will likely require technical guidance).

Thank you!

drahnr commented 2 years ago

Hey, it should definitely work and PRs are very welcome! I have to look into the details again, could you provide a simple example flow that yields your panic? The reference does not show a line number. At least a backtrack would be very helpful!

hweom commented 2 years ago

Here's the crash stack:

thread 'main' panicked at 'Input Shape Mismatch
Expected [32, 3592]
Actual [3592]', /home/mikhail/Documents/juice_my/juice/src/layer.rs:540:17
stack backtrace:
   0: rust_begin_unwind
             at /rustc/c8dfcfe046a7680554bf4eb612bad840e7631c4b/library/std/src/panicking.rs:515:5
   1: std::panicking::begin_panic_fmt
             at /rustc/c8dfcfe046a7680554bf4eb612bad840e7631c4b/library/std/src/panicking.rs:457:5
   2: juice::layer::Layer<B>::forward
   3: <evolution::rl::solver::JuiceSolver<SolverB,B> as evolution::rl::solver::Solver>::forward
   4: <evolution::rl::trainer::Trainer<S,A> as evolution::rl::Policy<S,A>>::get_action
   5: <evolution::env::rl_sheep::RlSheep as evolution::env::actor::Actor>::turn
   6: evolution::env::world::World::cycle
   7: evolution::main

hweom commented 2 years ago

I've tried to dive into the source code a bit, and so far I'm very confused. Not sure if this is the best way to ask specific technical questions (let me know if you prefer a different way), but here are some:

It seems juice tries to support different types of solver, but the assumptions of SGD are leaking into the parts that ideally should be abstracted:
- Solver (an implementation-agnostic class) has train_minibatch() function.
- Some layers implementation are aware that inputs can be arranged in batches.
Not sure I follow the logic behind the separation of layer functionality into Layer and ILayer.
- ILayer seems to just compute the gradients, while Layer does the plumbing and stores the data. But then ILayer has functions like input_data(), inputs_gradients() etc which do look like the data should be stored inside the ILayer implementations (it looks like those functions are not used/overridden, so maybe it needs a cleanup?).
- ILayer has functions that configure the Layer logic, like auto_output_blobs(), min_output_blobs(), is_container(), etc, which seem more like static parameters. Although one may argue that this is cosmetics...
- Overall, it feels that interactions between Layer and ILayer can be clarified (maybe with some renaming: at the beginning just from the names I assumed Layer is an implementation of ILayer).
Similar note on the interaction between Solver and ISolver. This is additionally complicated by the fact that there is apparently a sub-hierarchy of SGDSolvers (naming again: it's a trait derived from ISolver but it doesn't have the I prefix).
For some functionality, layers delegate to their children (like forward()), but for the other the master layer just collects and flattens all the data from nested layers and does everything on the top level (update_weights()). I think it mostly follows from the decision to make containers (Sequential) layers too, but here this "unification" starts to look forced. There is very little in common between containers and "business" layers and having to keep them both under the same trait makes the implementation awkward. (Besides, are there any actually useful containers other then sequential?.. Seems like layer sequencing would be handled more clearly outside of layer paradigm.)

Sorry if this looks like criticism -- I've tried to capture my thoughts of a person who is fresh to the code. Hopefully you find it useful :)

drahnr commented 2 years ago

I've tried to dive into the source code a bit, and so far I'm very confused. Not sure if this is the best way to ask specific technical questions (let me know if you prefer a different way), but here are some:
1. It seems juice tries to support different types of solver, but the assumptions of SGD are leaking into the parts that ideally should be abstracted:

   * `Solver` (an implementation-agnostic class) has `train_minibatch()` function.

For most solvers, this seems reasonable, and worst case, one could run multiple single pass runs within the minibatch. So not much pain with a bit of gain :)

   * Some layers implementation are aware that inputs can be arranged in batches.

From the top of my head, I don't quite remember where that is, could you add a reference?

2. Not sure I follow the logic behind the separation of layer functionality into `Layer` and `ILayer`.

   * `ILayer` seems to just compute the gradients, while `Layer` does the plumbing and stores the data. But then `ILayer` has functions like `input_data()`, `inputs_gradients()` etc which do look like the data should be stored inside the `ILayer` implementations (it looks like those functions are not used/overridden, so maybe it needs a cleanup?).
ILayer wraps the operation while Layer stores genric metadata, that is abstract over all of them. Technically that could be encoded in the ILayer trait as well. This could indeed be a candidate for cleanup.
   * `ILayer` has functions that configure the `Layer` logic, like `auto_output_blobs()`, `min_output_blobs()`, `is_container()`, etc, which seem more like static parameters. Although one may argue that this is cosmetics...
Imho it's exactly that, and by adding to the trait ILayer, it allows to use them independently.

Layer could indeed be entirely removed. The usage would change though, and I am not sure this is better than what is currently there.

   * Overall, it feels that interactions between `Layer` and `ILayer` can be clarified (maybe with some renaming: at the beginning just from the names I assumed `Layer` is an implementation of `ILayer`).

This is indeed a pitfall, LayerOperation seesm to be more reasonable and in-line with what it does.

3. Similar note on the interaction between `Solver` and `ISolver`. This is additionally complicated by the fact that there is apparently a sub-hierarchy of `SGDSolver`s (naming again: it's a trait derived from `ISolver` but it doesn't have the `I` prefix).

The whole I prefix is some cruft that should be removed ASAP. It's some naming scheme carried over from C++ by the original authors.

4. For some functionality, layers delegate to their children (like `forward()`), but for the other the master layer just collects and flattens all the data from nested layers and does everything on the top level (`update_weights()`). I think it mostly follows from the decision to make containers (`Sequential`) layers too, but here this "unification" starts to look forced. There is very little in common between containers and "business" layers and having to keep them both under the same trait makes the implementation awkward. (Besides, are there any actually useful containers other then sequential?.. Seems like layer sequencing would be handled more clearly outside of layer paradigm.)

Not convinced this would help much. I.e. currently there is only sequential, but the idea of making everything a layer does maintain flexibility, i.e. what would be needed is a split-layer and a concat-layer type, both which could be layers themselves, and imho it should be fairly ok to impl with the current paradigm. If you have a concrete prosposal how to improve this, I am happy to discuss that :)

Sorry if this looks like criticism -- I've tried to capture my thoughts of a person who is fresh to the code. Hopefully you find it useful :)

No, the criticism is very much needed and very very much appreciated. Thanks for taking the time to dive into the code!

hweom commented 2 years ago

This is indeed a pitfall, LayerOperation seesm to be more reasonable and in-line with what it does. The whole I prefix is some cruft that should be removed ASAP

OK, let me take a stab at some renaming. I'm thinking of this:

ILayer -> Layer (not LayerOperation because I think we think about Linear, Rnn etc as layers, not layer operations; there's also existing LayerOps trait).
Layer -> LayerRunner (or LayerManager or LayerHolder or LayerSupport).

I think it also makes sense to move update_weights() of Layer, because this operation is done once over all network and not for each layer individually. Not sure where to put it yet, maybe into Solver?

However, this is just refactor, and the core issue in this comment persists.

drahnr commented 2 years ago

Not sure this is the best idea, since the userfacing part should remain named Layer. Public API is not perfect, but I don't want to break it for a rename.

I don't think there is a good place for fn update_weights, solver sounds reasonable though.

drahnr commented 2 years ago

However, this is just refactor, and the core issue in this comment persists.

I think fn network_mut could work, but since all layers have internal mutability anyways, feel free to add this as needed.

My approach would be to simulate a minibatch. Take a sample, duplicate it internally to match the size of the true minibatch cardinality and pretend it was a minibatch. This is the dirty variant. My time seems to be rather over-committed lately so I did not have the time yet for a deep dive into the issue.

Happy to discuss PRs though, I just want to urge you to first write an example down how you'd expect to use the API, create a draft and then we can discuss based on that how to best achieve an impl that can provide the desired features, if that sounds good to you : )

hweom commented 2 years ago

the userfacing part should remain named Layer

Well, Layers are not exactly userfacing. User mostly deals with LayerConfigs. The only place when user can get their hands on a layer is through Solver::network(). But I'm not sure it's actually a very useful public-facing API in the first place -- Layer API is designed for Solver, and doesn't make much sense as an independent type (just by looking at its documentation I can't figure out how to use it). I would probably make Solver::network() return a top-level Network class that will only expose functions that actually can be useful outside of Solver.

I don't think there is a good place for fn update_weights, solver sounds reasonable though.

Another possibility is to put it into this new Network class I mentioned above. Which kind of makes sense, this is operation is done once for all network.

My approach would be to simulate a minibatch.

Yeah, that's what I do now. It's a 32x increase in compute cost (in order to get 1 prediction I need to run the full batch, which is 32 vectors), but I don't see an easy way around it, unfortunately. I have some thoughts on how to maybe improve this, but this will require rearchitecting the framework (specifically, store buffers not in layers but in some external context).

first write an example down how you'd expect to use the API, create a draft and then we can discuss

Sure, I don't want to land a big PR on you without aligning on the approach first. I value your and my time :) Let me know if you want me to create a separate tracking issue for the renaming we're discussing -- it seems we're discussing many things in parallel, so it becomes confusing.

drahnr commented 2 years ago

If this solves the practical issue for reinforcement learning without complicating the regular use case, go for it :)

hweom commented 2 years ago

Yeah, I'm still investigating the code. The simultaneous learning/using is probably not the biggest blocker right now. I just recently discovered I probably will need asynchronous multi-task learning, so I'm trying to understand whether it's supported (I don't think it is, out of the box) or how hard it'll be to add it.

I'm currently ingesting the code and creating a technical description of Juice internals, will share it when it's done. From there, it'll be easier to understand if it's possible/acceptable to adapt it for RL purposes.

I don't have a ton of spare time myself (have a full-time job and a family), so progress is slow.

hweom commented 2 years ago

I have document (still in progress) here: https://docs.google.com/document/d/1arr0trMv6lDJmy9ejThlnoSri8BGgUJ64LVA0IzBTXE. It's open for viewing for all; can add collaborators if requested (I think it'll require a gmail account).

hweom commented 2 years ago

OK, I think the document is ready (same link as above). I would appreciate a review (the link gives read-only access, I can only share for commenting to a gmail account so IM me if you're interested).

Some additional notes and thoughts that I had while working on the doc:

I'm sold on the idea of containers being layers too. Component-based system organization (and Juice layers are exactly the components with strictly defined flow between them) is a natural choice for building graph-like systems. Supporting nested components (where a component can host child components) is very appealing for ML too, since one can train a net and then use it as a subnet in a larger net just by adding a single layer encapsulating the original net. However, I think Juice implementation of this paradigm has room for improvement:
- Component-based design requires that all components behave exactly the same when looking from outside (this is the Liskov Substitution principle in SOLID). For Juice, it means that all layer types can be used interchangeably. I don't think this is true now -- one can't use non-container layer as a top-level one (because it won't create input blobs) and Sequential layers can't be used inside a Sequential layer.
- Layer contracts are not well-defined (this is closely related to the bullet point above). Details about how the inputs/outputs and data blobs are created and connected are buried deep in the implementation and it's hard to understand what is an upfront design decision and what has organically grown.
Logic separation between the Layer and ILayer worker is confusing and these 2 entities are extremely coupled. I understand what it is trying to achieve -- concentrate the common logic between all non-container layers so it doesn't have to be repeated in every implementation. But the fact that Layer logic behaves radically different for container vs non-container workers creates a bit of a split-brain situation when the reader needs to track both possibilities simultaneously when reading the code. This is aggravated by the fact that the distinction between container and all other layers is not always explicit in Layer code -- it all depends on the implementation of certain ILayer trait functions, so one needs to constantly check how they are implemented in different layers.
As someone who's coming to Juice without a strong ML foundation, I would appreciate more explicit and rigorous terminology. It's hard to easily connect Juice variables like input_gradient, output_gradient, etc to backpropagation definitions on the Web, especially given that Juice's approach is not exactly a canonical one (i.e., activation function is not considered a "layer" in common literature). And more generally, external links to Wikipedia in Juice function comments are not very helpful, as Wikipedia articles are very general and do not help to understand that particular function (which is more about how this function fits together with other Juice pieces).
Separation of logic between the layers and the backend seems ... unbalanced. This seems to be mostly driven by functionality avaiable in cuDNN -- so layers like ReLU, convolution etc are just a thin gluing code between Juice and cuDNN, while layers like NegativeLogLikelyhood do all computation without using any of the backend facilities. Again, I understand why this is so (we want to leverage accelerated computation if it's available), but from an architectural perspective, it's a bit odd that the interface between subsystems is driven by one of the implementations.
Data flow is obscured since layers are holding references to data blobs internally. This is a perfectly reasonable technical implementation, but I wonder if the flow can be made a bit more explicit by changing the interface of the layer.
Solver architecture can also benefit from high-level explanation. I think ISolver only handles the learning rate, but this is not explicitly documented anywhere and needs to be guessed from the code.

drahnr commented 2 years ago

OK, I think the document is ready (same link as above). I would appreciate a review (the link gives read-only access, I can only share for commenting to a gmail account so IM me if you're interested).

Some additional notes and thoughts that I had while working on the doc:

1. I'm sold on the idea of containers being layers too. Component-based system organization (and Juice layers are exactly the components with strictly defined flow between them) is a natural choice for building graph-like systems. Supporting _nested_ components (where a component can host child components) is very appealing for ML too, since one can train a net and then use it as a subnet in a larger net just by adding a single layer encapsulating the original net. However, I think Juice _implementation_ of this paradigm has room for improvement:

:+1:

* Component-based design requires that all components behave exactly the same when looking from outside (this is the Liskov Substitution principle in SOLID). For Juice, it means that all layer types can be used interchangeably. I don't think this is true now -- one can't use non-container layer as a top-level one (because it won't create input blobs) and Sequential layers can't be used inside a Sequential layer.

* Layer contracts are not well-defined (this is closely related to the bullet point above). Details about how the inputs/outputs and data blobs are created and connected are buried deep in the implementation and it's hard to understand what is an upfront design decision and what has organically grown.

This is indeed true, there is only a vague juice book.

2. Logic separation between the Layer and ILayer worker is confusing and these 2 entities are _extremely_ coupled. I understand what it is trying to achieve -- concentrate the common logic between all non-container layers so it doesn't have to be repeated in every implementation. But the fact that Layer logic behaves radically different for container vs non-container workers creates a bit of a split-brain situation when the reader needs to track both possibilities simultaneously when reading the code. This is aggravated by the fact that the distinction between container and all other layers is not always explicit in Layer code -- it all depends on the implementation of certain ILayer trait functions, so one needs to constantly check how they are implemented in different layers.

Happy to merge these, I think they are indeed an artifact of the past and were a very early impl detail.

3. As someone who's coming to Juice without a strong ML foundation, I would appreciate more explicit and rigorous terminology.  It's hard to easily connect Juice variables like `input_gradient`, `output_gradient`, etc to backpropagation definitions on the Web, especially given that Juice's approach is not exactly a canonical one (i.e., activation function is not considered a "layer" in common literature). And more generally, external links to Wikipedia in Juice function comments are not very helpful, as Wikipedia articles are very general and do not help to understand that particular function (which is more about how this function fits together with other Juice pieces).

I had the same struggle when I started with leaf as it was called then. The notion is far from obvious. I think adding documentation is easier than changing it consistently across the codebase for now.

4. Separation of logic between the layers and the backend seems ... unbalanced. This seems to be mostly driven by functionality avaiable in cuDNN -- so layers like ReLU, convolution etc are just a thin gluing code between Juice and cuDNN,  while layers like NegativeLogLikelyhood do all computation without using any of the backend facilities. Again, I understand _why_ this is so (we want to leverage accelerated computation if it's available), but from an architectural perspective, it's a bit odd that the interface between subsystems is driven by one of the implementations.

It is. The entirety of juice is built around the API of cudnn, which still is the defacto standard lib. So this abstraction leaks the design of cudnn, but since that's the only fully covering impl (native is mostly on par, with a few exceptions) this will only change once a OpenCL impl is complete.

5. Data flow is obscured since layers are holding references to data blobs internally. This is a perfectly reasonable technical implementation, but I wonder if the flow can be made a bit more explicit by changing the interface of the layer.

Which ones are you refering to? The inner Rc?

6. Solver architecture can also benefit from high-level explanation. I _think_ `ISolver` only handles the learning rate, but this is not explicitly documented anywhere and needs to be guessed from the code.

The whole solver impl is pretty much trash and needs to be redesigned. It's very inflexible and feels like an addon to the rest. If you have input on how to achieve it, happy to discuss. It's not as trivial to align it with the rest of the API.

Thanks again for writing up your thoughts.

hweom commented 2 years ago

I just send a PR #159. As stated in the PR description, this is not intended for merging, more like a suggestion on how to improve layer framework and a conversation starter. I know we discussed that a short doc would be good before investing too much into code, but I found that actually writing the code is the best way for me to ensure my approach works and handles all cases. I'm not super attached to that PR either -- if we think it's not the right way, I happy to drop it and work on another iteration :slightly_smiling_face:

The primary motivating factors for this design are:

Allow mixing batch sizes in runtime (necessary for alternating learning/use).
Supporting asynchronous multi-tasking. While the PR doesn't have anything that hints at that, I think the new architecture allows it in principle (when doing backpropagation, container layers can see which output gradients are available and only backprop corresponding branches).

An interesting outcome of 1. above is that my design makes batch size very explicit and this uncovers some inconsistencies (at least I think they are inconsistencies) in layer structuring, like loss layers having outputs (see the comment in net::loss::NativeLogLikelihood::new()).

drahnr commented 2 years ago

Sorry for the delay, I should get around to it later today 🤞

hweom commented 2 years ago

Gentle ping on this one.

drahnr commented 2 years ago

Sorry for the long delay, it's been an eventful 3 weeks.

For me, I am already sold at just the fact that batch sizes can be adjusted as needed. This needs to be verified to be compatible with the current cudnn API, but iirc, it should work with a few small tweaks.

I'd say let's continue the code related discussions in the PR -> https://github.com/spearow/juice/pull/159

fff-rs / juice

Juice for Deep Reinforcement Learning #155