AnarchoSystems / DeepSwift

7 stars 0 forks source link

Sharing ideas about our work. #1

Open philipturner opened 2 years ago

philipturner commented 2 years ago

I saw your Swift Forums post, and thought I might move some discussion onto here. At the end, you said:

there's a good chance that I will continue the development until I can meaningfully evaluate if - using appropriate math frameworks - the approach can compare to torch and flow.

I think I might be able to give you some insight into this, if you would like to have a discussion. I'm quite interested in your motivation for taking on this project.

AnarchoSystems commented 2 years ago

Hey there,

I finally got a little time to read a bit into the libraries you mentioned in the forum, and I'm excited about the potential here :)

Regarding my personal motivation: basically, I wanted to put my M.Sc. in mathematics to use and get started with machine learning, but then I found out that torch and flow are basically the only (major) games in town, and they're both Python and Python is ... not my favorite language. Thus, I put some distance between me and the industry (right now, I'm working as a C++ backend dev, but nothing machine learning related) and started reading a lot in my free time where I have no economic pressure to "just write code that works" - the death sentence for any creativity that probably also killed Swift4TF.

I want to achieve more static features and an architecture where the developer understands where the graph actually lives. In particular, I have the ambition to decouple differentiable features from computational types. Tensors should not care about backpropagation and derivatives, layers using the tensors should. I think, the current framework can achieve this.

Another interest of me is to create an operator-centric DSL for neural networks. Swift with its operator overloading capabilities seems ideal for this. My private (soon to be open sourced, but unfortunately very un-tested) repo is actually capable to express:

struct ResNet : Learner {

   var body = SomeLayer() + Identity()

}

struct Attention : Learner {

   // where queries, keys and values are perceptrons and factor is a scalar;
   // note that in my framework, the training sequence is encoded as column vectors,
   // as I implemented col major matrices, so the below formula differs from the literature 
   var body = softMax((queries †* keys) / factor) *† values

}

This unlocks equational reasoning and a super modular design with minimal boilerplate.

Have you checked out how I do the chaining of layers? It's excitingly simple, and basically comes right out of Backprop as a Functor:

public struct ChainedLayers<L1 : Layer, L2: Layer> : Layer where L1.Output == L2.Input {

    public typealias AuxiliaryData = (L1.Output?, L2.AuxiliaryData?, L1.AuxiliaryData?)
    public typealias Adjustment = (L1.Adjustment?, L2.Adjustment?)

    @usableFromInline
    var first : L1
    @usableFromInline
    var second : L2

    @inlinable
    public func inspectableApply(_ input: L1.Input) -> (result: L2.Output, auxiliaryData: AuxiliaryData) {
        // as far as I understand how memory works in swift, this should actually
        // pre-allocate the entire auxiliary data graph
        var aux : AuxiliaryData = (nil, nil, nil)
        let r2 : L2.Output
        (aux.0, aux.2) = first.inspectableApply(input)
        (r2, aux.1) = second.inspectableApply(aux.0.unsafelyUnwrapped)
        return (r2, aux)
    }

   // THIS (the below function) does all the heavy lifting for back propagation

    @inlinable
    public func adjustment(input: L1.Input, auxiliaryData: AuxiliaryData, gradient: L2.Output.Adjustment) -> (adjustment: Adjustment, backprop: L1.Input.Adjustment) {
      // pre-allocation
        var adj : Adjustment = (nil, nil)
        let b2 : L2.Input.Adjustment
        let b1 : L1.Input.Adjustment
      // call the appropriate methods of the layers in reverse order, back-propagating the gradient 
        (adj.1, b2) = second.adjustment(input: auxiliaryData.0.unsafelyUnwrapped, auxiliaryData: auxiliaryData.1.unsafelyUnwrapped, gradient: gradient)
        (adj.0, b1) = first.adjustment(input: input, auxiliaryData: auxiliaryData.2.unsafelyUnwrapped, gradient: b2)
        return (adj, b1)
    }

    @inlinable
    public mutating func move(_ adjustment: Adjustment) {
        first.move(adjustment.0.unsafelyUnwrapped)
        second.move(adjustment.1.unsafelyUnwrapped)
    }

   // ... optional methods are also implemented to improve efficiency whenever not all the data is needed

}

Here's where I want to go with this

I want to implement all the key features that make this into a usable library. In particular, I want to support all the popular activations, losses, parametric layers, optimizers and fancy training set-ups to prove that my architecture is flexible enough. I want to achieve support for CPU, GPU and maybe even distributed computation. Additionally, I want to provide control-flow-like layers that are easy to work with and fit nicely into the operator-based setting we have here.

Ultimately, I hope that with enough dedication and hard work, this library can achieve the same status for machine learning that https://elm-lang.org/ (the language that inspired React, SwiftUI and others) had for frontend development.

Here's where I need help

While I have majored in mathematics, my experience is that mathematics is too complicated for just one brain, so I would be happy about help/reviews with any layers that involve math.

If this is going to be a serious library, I will also need help finding good math libraries that can run on any platform without a too complicated installation process for users of DeepSwift. Thanks for the tip with DL4S, I will take my time to carefully evaluate how to integrate this into my existing code base.

Finally, I need someone that helps me with further API design. I put a lot of emphasis on separation of concerns and static knowledge, but at the same time, these ambitions have been holding me back again and again for years now because these goals can contradict each other. An intellectual sparring partner (or even a team) could help me a lot.

Please don't treat this as "my" library

After all, I open sourced it so others could engage with it. I hope that over time, some enthusiasts (maybe you?) will come together and form a core team of equals. I'm a very consensus oriented person when it comes to development of a code base, even though I do have strong opinions and I'm not shy to express them.

So... what is your background and motivation to work with numerics/ML in swift? :)

AnarchoSystems commented 2 years ago

Hey there,

in case you are willing to work with me, I would like your feedback on the remote computation model I just added. tl:dr: it's a sort of functor :D

Background: if one keeps orchestrating remote computation or even just GPU computation on a local CPU, one introduces a bottleneck of back and forth in communication. I would prefer if I could "compile" an instruction set into a single command that I can just leave alone on the remote device until it's finished. The protocol enables that, even though it will be hard work to achieve this behavior in practice.

I do think that this might actually be workable, but I'm wondering how to inject the necessary capabilities into the layers' methods. Basically, each of the methods must be a "compilable" remote instruction for the same device type. The current Layer protocol and some others would basically become conveniences for a singleton CPU device type.

I'm also wondering if my changes from last night were a good idea. I force models to have one scalar type only. This way, I was able to achieve global optimizers. Looks way cleaner and has more static information, but having only one scalar type might be too limiting.

I haven't really dealt with the question how to inject a tensor math instruction set into remote computation, but I'll get some inspiration from DL4S and then come back to that question.

What's your opinion? Sorry if I'm moving too fast or seem pushy.

philipturner commented 2 years ago

Hi Markus,

Thanks for your response. I waited a day because I wanted to make sure I gave a response that was encouraging and respectful of all the hard work you put into this.

It's great to see that you have enthusiasm for this. I used to feel the same way about my own goal with S4TF. But my focus area isn't 100% aligned with yours. Compiler-integrated AutoDiff is something that's never been done before, and S4TF was originally created to showcase that. As a result, I likely won't be able to rewrite S4TF to depend on your library.

That's not to say I can't learn something from your work. I saw a striking similarity between your library's Movable protocol and the Swift Differentiable protocol. Examining your code might allow me to understand how the differentiation transform works in the Swift compiler. And there are very few software developers with both a strong affinity for Swift and a passion for numerical computing. The Swift numerical community is extremely small.

After all, I open sourced it so others could engage with it. I hope that over time, some enthusiasts (maybe you?) will come together and form a core team of equals. I'm a very consensus oriented person when it comes to development of a code base, even though I do have strong opinions and I'm not shy to express them.

If other people happen to see and utilize its potential, that's great. But speaking from my own experience, I would say to prioritize enjoying your work over anything else. Before I worked on ML, I had worked on AR for a very long time. I thought that the moment I released my research, everyone would see the potential and news would spread like wildfire. That unfortunately didn't happen, and I gave up to pursue a different field. I may be biased by experience, so perhaps your project may be different.

An intellectual sparring partner (or even a team) could help me a lot.

I do think I could hold conversations where we discuss different ideas. I have learned that being open to discourse and people with a different opinion helps you grow as a software engineer. For example, I have been planning a GPU backend for a ML framework for months. While my original plans were for it to be general-purpose and usable by any frontend framework, I realized that idea is just not realistic. My goal has shifted to a more humble one of making two backends tightly integrated into S4TF using one common interface. My perspective seems to be opposite to yours, but I think your viewpoint is valuable and I would welcome a discussion about it.

Background: if one keeps orchestrating remote computation or even just GPU computation on a local CPU, one introduces a bottleneck of back and forth in communication. I would prefer if I could "compile" an instruction set into a single command that I can just leave alone on the remote device until it's finished.

That is something I have thought of a lot. There is a massive CPU-side bottleneck to encoding commands, and I have a whole host of stats regarding the cost of each Metal API call. The solution I found for reducing overhead is quite complex, but I see some similarities to your idea. Perhaps we could discuss that further.

AnarchoSystems commented 2 years ago

It's great to see that you have enthusiasm for this. I used to feel the same way about my own goal with S4TF. But my focus area isn't 100% aligned with yours. Compiler-integrated AutoDiff is something that's never been done before, and S4TF was originally created to showcase that. As a result, I likely won't be able to rewrite S4TF to depend on your library.

Don't get me wrong, I find the idea of compiler integrated autodiff interesting. I just don't like it if I get the feeling that something can't be done without some magic that you just can't do. I would hope that Swift's autodiff is a well-typed abstraction. I just hope that something valuable will come out of your work and that we can help each other with math and math-framework stuff.

I thought that the moment I released my research, everyone would see the potential and news would spread like wildfire. That unfortunately didn't happen, and I gave up to pursue a different field.

I've actually had my own share of failed experiments. My last experiment that I put a lot of effort into was RedCat, a frontend framework that tried to do things a bit different, but largely similar to the composable architecture. While I did hope that it might get some attention and I enjoyed working on it, I eventually dropped the ball after having achieved some proofs of concept and then finding that I had produced yet another hammer that made everything look like a nail.

This might eventually still happen with DeepSwift, but I don't think that I'll be able to rest before I actually compared it to existing python frameworks.

My goal has shifted to a more humble one of making two backends tightly integrated into S4TF using one common interface.

That still sounds very challanging, and this is hopefully an area that we can explore together. Can you point me to the places in your repos/forks that you are thinking about?

That is something I have thought of a lot. There is a massive CPU-side bottleneck to encoding commands, and I have a whole host of stats regarding the cost of each Metal API call.

Do you happen to know if and how torch and/or flow solved that? Do you have any stats on non-apple hardware and APIs? Maybe we're tapping into something here...

The solution I found for reducing overhead is quite complex, but I see some similarities to your idea. Perhaps we could discuss that further.

I hope that my solution will end up being simple. After all, translating faithfully between languages is exactly what functors have been made for. If only I find a way to encode exactly what I want without too much ceremony, it should just work by construction. But I already noticed that for actual compilation, my last commit doesn't quite do the job. Thing is: just because I can require everyone implementing my RemoteComputation protocol to provide a compile method doesn't mean that I can compose two remote computations. This occurred to me as soon as I thought about how to make my ChainedLayers conform to this. The big problem is that Swift lacks higher kind types, so I cannot ask people to provide a composition operator. It's impossible to write functor protocols, at least in a way that isn't totally esoteric.

I've been thinking about a workaround, and I think I'll just ask the device to bring some InstructionBuffer. I'll play around with that a bit.

AnarchoSystems commented 2 years ago

Here's the problem with instruction buffers (added last night): yes, it solves the problem of composability, but it comes at the cost of dynamism, and the whole solution - while conceptually simple - will lead to hard to understand and possibly boilerplate code for the individual layers. In case of a CPU backend, it may not even be an acceleration (except if I really go to the length of encoding each function as some set of CPU instructions, and even then it might not beat the Swift compiler because I couldn't do much in terms of optimization).

I think, instead of making each method of each relevant protocol generic over the device, I should provide a "GPUableLayer" and possibly a "RemotableLayer" protocol. In case of Learners, it should be possible to provide default implementations for everything that is relevant.

philipturner commented 2 years ago

Your idea of an instruction buffer is similar to a MTLCommandBuffer or ID3D12CommandList. You want to encode command into an object that can be executed asynchronously on a queue. Regarding optimization, it's mostly about the CPU overhead of creating a command buffer.

Creating a command buffer has a large overhead - at least 10 microseconds (us). Traditionally, you encode many commands into a command buffer before submitting. For example, you could encode an entire render pass or an entire frame's worth of work on a command buffer. But for machine learning, you can't hard-code a way to bunch up several commands into one command buffer. They execute as they come in, and you cannot predict what the next command will be. That means a simple backend would create a new command buffer for each operator.

For a machine learning backend, you could need to execute operations with extremely high sequential throughput. When executing RNNs in eager mode, it might take less than 10 microseconds to execute each operator. If you instead executed that RNN on the CPU, it might be faster because there is almost no overhead. So I am trying to find a workaround that gives the GPU just as high sequential throughput as the CPU.

For simplicity, say that it takes 10 us to create a command buffer object, and 1 us to encode an operator into it. When creating a command buffer for each operator, that's a minimum of 11 us of overhead for operator. But if you encode 10 operators per command buffer, the average overhead reduces to 2 us. How might you get the ability to buffer up commands like that? Create a FIFO queue with a massive backlog of commands waiting to be executed.

This is the part that's really difficult conceptually. Say that you want to have around 30 operators calls cached. If you reach 30 pending commands, the CPU has to wait to submit the next command until that number drops to 29. Each time a command buffer finishes, the queue is notified and runs 10 of those commands on a GPU command buffer. The number drops to 20, and the CPU can encode 10 commands.

But what is the right number of commands? If most of the 10 commands are for allocating memory, you might allocate more memory than the device can hold. Perhaps you could have subdivided it into two sets of 5 commands, letting the first 5 commands' memory deallocate before executing the next 5.

Also, when you drop the 10 commands at the front of the queue into a command buffer, when does that take place? It would take at least 20 us to perform that operation (10 us for creating the command buffer, 10 x 1 us for each command). Would it be better to encode the commands on a background thread, or would the overhead of communication between CPU threads dwarf that?

There's also an issue of caching compiled shaders or MPSGraph's. Many ML commands will require using MPSGraph even in eager mode, yet it takes 100 us just to create an MPSGraph - much larger than the minimum of 1 us per command. You could pre-compile an MPSGraph, but I found in my profiling that encoding a MPSGraph takes around 15 us, even if it has just one command. If you have 2 commands, it takes 30 us and the overhead rises linearly.

Edit (made long after creating this comment): To get an overhead even as low as 15 us, I think you have to compile the MPSGraph first. That action can take at least 400 us. But fortunately, I had a successful experiment in compiling several MPSGraphs on multiple CPU threads simultaneously. So divide the 400 us overhead by 8 power CPU cores, and it can lower to an amortized 50 us.

A potential solution is that not all operators need to happen on MPSGraph in eager mode. It's mostly convolutions and stuff like depthToSpace that can't happen in a trivial Metal shader. If most submitted operators are just elementwise stuff like adding two tensors, the average overhead per command will be far lower. Say that I have pre-cached the GPU shaders for elementwise operators. If around half of all the operators are simple ones, the average overhead per command is (1 us x 50%) + (15 us x 50%) = 8 us. That's less what would be (10 + 8 = 18) us per command if you made one command buffer for each command, but much larger than 1 us per command. Stuff like this is why it's so complicated to set up the FIFO queue. What is the minimum possible overhead per command, and can it be low enough that a GPU becomes a drop-in replacement for running the model on the CPU in all reasonable scenarios?

AnarchoSystems commented 2 years ago

I think we should make an important distinction here: there are native, framework provided Command Buffers that directly encode GPU operations, and then there are Instruction Buffers that one would have to implement in order to write a GPU backend for DeepSwift. The latter can of course encode whatever type of instruction we want, GPU or CPU.

Whenever we're not dealing with recurrent networks and control flow inside layers, the commands that you need to run for one layer actually are quite predictable. So it would be natural to have plain sequential computations encoded directly onto metal Command Buffers while anything that requires some CPU checking is encoded in another way. One would need to create the next command buffer whenever we can safely go back from CPU to GPU.

Of course, all of that assumes that the "predictable" layers are stacked deeply enough.

I actually think that it is algorithmically feasible to keep the parameters of the model in GPU memory all the time (provided we have enough memory there; I don't have enough experience to judge this), resulting in minimal data transfer overhead. I'm just not 100% sure if this also means that we can add the associated computations quickly to a command buffer or maybe even reuse the same command buffers over and over again. Do metal command buffers capture used memory blocks by reference? I would suppose that they have to, as capturing them by value would just be excessive, right?

To be totally honest, my experience with GPU computations is limited to applying some pre-existing filters to images and learning that stacking and then applying is faster than applying them one by one.

philipturner commented 2 years ago

I'm just not 100% sure if this also means that we can add the associated computations quickly to a command buffer or maybe even reuse the same command buffers over and over again.

You can't reuse a MTLCommandBuffer after it's submitted and creating MTLCommandBuffer objects beforehand doesn't seem to help. I tried a performance experiment where there's a background thread that pre-creates around 20 of those objects to potentially eliminate the 10 us overhead. But Metal has strange performance patterns when accessing objects created on another thread. The overhead jumps to something like 70 us! Also, there's still a minimum of 4 us overhead for using Swift Concurrency to fetch that command buffer.

Early on in my plans, I thought I could use a MTLIndirectCommandBuffer as a solution. It would minimize the cost of encoding commands, because commands are pre-encoded into it. But that has multiple problems. It's something that's encoded into a regular MTLCommandBuffer, so you still have to deal with that 10 us overhead. Second, only custom commands would be able to be encoded. MPS and MPSGraph kernels can't be encoded into an indirect command buffer. Another idea was pre-compiling predicted command sequences into a MPSGraph. But as I explained above, that didn't turn out to work.

Of course, all of that assumes that the "predictable" layers are stacked deeply enough.

Let's say that there was some third alternative to MTLIndirectCommandBuffer or MPSGraph that let me efficiency pre-encode commands. That would reduce the per-operator overhead to 0 us, leaving the only bottleneck being creating the MTLCommandBuffer (10 us). Predicting commands means caching sequences you have previously seen, and it seems to be extremely computationally complex.

Let's say that you have a repeating pattern of 7 commands in a tight loop, but after the 7 commands there's an extra one that changes on each iteration. You would need some way to detect that pattern. That would work if your search window was 8 commands wide. But what if you didn't know its period of repetition? If your search window was 5 or 10 commands wide, each time you encounter the command, it would start at a different place. Visually, it's like this:

search window = 8
1234567A
1233467B
1234567C
- Easy to algorithmically search if the first 7 elements are the same

search window = 10
1234567A12
34567B1234
567C123456
- There are 10 different positions where the pattern can repeat

If the search window doesn't fit, you might accidentally detect 10 replicas of the pattern - each one starting at a different place. Even if they didn't change the 8th command at every iteration, that problem would still happen. But that problem could be solved if you just try again with different search window size until you detect a pattern.

But how would you know the largest search window to scan before giving up? There might also be really massive sequences - loops spanning 100 iterations. If you repeated your search 100 times, you might sometimes detect a 100-long sequence. But what if it was even longer? It would start getting out of hand (O(n^2)), and would only speed up performance for small enough loops.

There's also an approach where you give each 3-wide sequence of ops you encounter a unique hash. Cache that in a dictionary, and if you hit that hash enough times, decide to pre-compile it. But what should you set "enough times" to? And the dictionary could theoretically get as large as #OPS x #OPS x #OPS = O(n^3). Increasing the sequence size would increase the memory consumption of that dictionary exponentially. So you would need to keep it small. But at that point, what would be the benefits of pre-compiling something? Would it reduce the overhead of encoding to something significantly smaller than the 10 us of creating a command buffer?

Do metal command buffers capture used memory blocks by reference? I would suppose that they have to, as capturing them by value would just be excessive, right?

I also have very detailed plans about memory management, but it's best to talk about this later. It's best to discuss one thing at a time.

P.S. I don't know if I seem really fired up or dismissive. I just have so many ideas I thought of for so long and I'd like to put them out there for debate. I didn't read over this comment for phrasing issues because it's just so long.

AnarchoSystems commented 2 years ago

P.S. I don't know if I seem really fired up or dismissive. I just have so many ideas I thought of for so long and I'd like to put them out there for debate. I didn't read over this comment for phrasing issues because it's just so long.

You sound informative, and that's valuable ;)

Predicting commands means caching sequences you have previously seen, and it seems to be extremely computationally complex.

Let's say that you have a repeating pattern of 7 commands in a tight loop, but after the 7 commands there's an extra one that changes on each iteration. You would need some way to detect that pattern.

Static typing to the rescue!

What we really need here are control flow primitives so we can encode this information in our layer types. I already have a Many layer type that just repeats the same type of layer a fixed number of times - basically a for loop. There's also a Repeated type that applies the exact same layer multiple times in a weight-tying way. Obviously, I can also stack heterogenous layers on top of each other using my ChainedLayers type. In each case, we can encode the whole chain down to a single command buffer whenever we can do that for each participant in the chain. Basically, it's those chained layers' responsibility to decide if they can be added to an existing command buffer or if they need a new one.

The problems you mention would arise in DeepSwift if I were to introduce a While layer or an IfElse layer. Eventually, I'll have to do this in order to express the most general architectures. In this case, we will have to encode things into different command buffers. We will have to rely on the developers to understand this and limit the use of those constructs to situations where they are absolutely necessary.

I can see that this would be way harder to achieve with differentiable programming embedded into the compiler.

You can't reuse a MTLCommandBuffer after it's submitted and creating MTLCommandBuffer objects beforehand doesn't seem to help.

Well, that's bad. Is that a framework/hardware related issue or is this generally the case? I still think that if we are to compare frameworks and not ecosystems, we should maybe find a way to run our frameworks on NVIDIAs.

philipturner commented 2 years ago

I can see that this would be way harder to achieve with differentiable programming embedded into the compiler.

I don’t see how this logically follows. One call to gradient in S4TF can trace all of the ops at a high level. That’s how lazy tensors and graph computations work. Also, note that all of my discussion above was just to optimize eager mode where your goal is to minimize startup overhead. If you want to know the commands ahead-of-time, that’s what major frameworks already do in graph mode.

Well, that's bad. Is that a framework/hardware related issue or is this generally the case? I still think that if we are to compare frameworks and not ecosystems, we should maybe find a way to run our frameworks on NVIDIAs.

If you’re looking for something platform-agnostic, DLPrimitives is a good math kernel library for OpenCL. It doesn’t to matrix multiplications quickly on Apple GPUs because of a fundamental hardware limitation, but that’s what Metal/MPS is for.

The cost of creating a command buffer is not something you can work around in Metal. I think OpenCL is known for being even worse. I am currently working on SwiftOpenCL so that I can use OpenCL for S4TF, but I still don’t understand how they encode commands. It may be an even slower model where each operator requires the same overhead as a command buffer.

AnarchoSystems commented 2 years ago

I can see that this would be way harder to achieve with differentiable programming embedded into the compiler.

I meant it is harder to implement something at the compiler level than at a higher level, if the compiler doesn't already do this. But according to you, graph mode already does this particular thing. I did not know about graph mode, but that's basically, what I am trying to achieve. I don't understand tf's "documentation" on tracing though, the implicitness of the approach makes me worry that I'd trigger retracing all the time for no reason. My approach is explicit, you'd actually end up with a concrete "graph"-like object that won't ever retrace on its own. That means that you have to do it manually if you ever need to, but really that just means that you'll be aware of the process then.

If you’re looking for something platform-agnostic, DLPrimitives is a good math kernel library for OpenCL.

Thanks for the tip! But yes, as you say, platform specific frameworks might give better performance. Which is why I'd probably be looking into CUDA as soon as I figure out how to enable graph mode with DeepSwift so I get a fair comparison.

Also, note that all of my discussion above was just to optimize eager[...]

What is eager mode even good for, except maybe debugging (where you can totally live with reduced performance)? And I'm not sure what optimizations you're talking about there, the only sensible optimization seems to be graph mode.

[...] mode where your goal is to minimize startup overhead.

Nahh, what I want is graph mode as I've just learned :)

I did find another weakness with my current approach to graph mode though. Everything is fine as long as I'm happy with out-of-place functions. Memory management is trivial and a functor will just do the job. But as soon as I want to do any in-place computations (which I do when updating my models), my approach just can't express that. I was thinking about different remedies, but most of them are just awkward. I guess I'll have to come up with an approach that isn't based on functors, but maybe closer related to computational graphs. I have some ideas here... stay tuned ;)

philipturner commented 2 years ago

And I'm not sure what optimizations you're talking about there, the only sensible optimization seems to be graph mode.

There is a lot of reason to optimize code not in graph mode. For starters, only a small subset of the operators can actually run in graph mode. If you look at S4TF's XLA backend, it's less than 5% of all possible operators. So the performance characteristics of highly sequential, unpredictable execution will still be at play. If the graph is known beforehand, you still need to divide it into two pieces are operators that aren't supported by the graph. A fine-tuned FIFO queue would massive reduce any CPU-side bottlenecks no matter which mode you use.

Also, eager mode is both more ergonomic and the default mode. I still don't know whether I've actually used S4TF's graph mode in any of my experiments. The end user of an API is often not going to follow instructions, and unfortunately it's our job to compensate for that. I think that rule of thumb really drove Swift-Colab to be the user-focused experience it is now. So making eager mode as fast as possible is a priority for my framework.

Which is why I'd probably be looking into CUDA as soon as I figure out how to enable graph mode with DeepSwift so I get a fair comparison.

I don't mean to be discouraging, but learning a GPGPU API takes a tremendous amount of time. Even worse, the skills don't necessarily transfer over between APIs. I spent well over a year before becoming proficient at Metal. And my SwiftOpenCL project's true purpose is to familiarize myself with OpenCL. From my experience, going with OpenCL and DLPrimitives from the start would be the most realistic option.

Memory management is trivial and a functor will just do the job.

Having a highly performant memory allocator is mostly up to the backend, rather than the frontend. Instead of ARC semantics, the main concern is reusing existing allocations and not overwriting memory twice. I have quite elaborate plans for how to optimize that, and I am willing to share.

AnarchoSystems commented 2 years ago

If the graph is known beforehand, you still need to divide it into two pieces are operators that aren't supported by the graph.

As far as I understand it, tf's graph will just ignore everything it doesn't know. This sucks. I'd rather live in a world in which using instructions that the backend doesn't know causes a compile error - again with the caveat that by "backend", I don't necessarily mean purely the GPU, but also some CPU orchestration primitives that may or may not be necessary.

I still don't know whether I've actually used S4TF's graph mode in any of my experiments.

Yeah, that's what I thought after reading the "documentation"...

I don't mean to be discouraging, but learning a GPGPU API takes a tremendous amount of time. Even worse, the skills don't necessarily transfer over between APIs. I spent well over a year before becoming proficient at Metal. And my SwiftOpenCL project's true purpose is to familiarize myself with OpenCL. From my experience, going with OpenCL and DLPrimitives from the start would be the most realistic option.

Makes sense. In the long run, I do want to compare apples to apples, i.e., I'll have to run the same network using DeepSwift and using torch/tf on the same hardware and with the same training loop. But sure, I first have to get anything working at all...

Having a highly performant memory allocator is mostly up to the backend, rather than the frontend. Instead of ARC semantics, the main concern is reusing existing allocations and not overwriting memory twice.

And that's where imperative code makes live 1. more efficient but 2. waaaay more difficult :( With my functor approach, I could just deallocate the remote memory for the input of a function after the function has executed, leaving the caller with a single remote memory pointer the deallocation of which could even be abstracted away. But that would be highly wasteful.

I have quite elaborate plans for how to optimize that, and I am willing to share.

Please do!

As for my own ideas:

I'm thinking about a Symbol<Device, T> type that holds a reference to an instance of Device's memory and that is bound to some type T (that is mostly there for keeping track of computations, but can possibly be mapped to a local type). The device would provide an allocate and deallocate method. Symbol itself will likely be a class type so I can express computations via sub-classes that capture immutable symbols and decorate mutable symbols while everything would retain static type Symbol<Device, T>. The dynamic type would express the compute graph. Symbols would be the only thing in the code base that have an encode(to instructionBuffer:) method that would make heavy use of decorators.

While this approach may lead to highly readable code, memory management remains awkward. The problem is that just because class-bound variables leave your scope, that doesn't mean they get deallocated. The entire point of using decorators is to capture those variables into the compute graph!

Currently, I can think of three ways to solve this: 1. manually allocate and deallocate the local variables at the begin and end of each function. 2. have each symbol declare an array of variables that shall be allocated and deallocated. 3. use property wrappers and reflection to dynamically create that array at runtime.

I think, I'll go with the third approach. I'm eager to see what pitfalls I'll run into there :D What could possibly go wrong? :D

AnarchoSystems commented 2 years ago

Ah, and by "allocating" and "deallocating", I of course mean encoding the command to allocate or deallocate. Or actually just doing this stuff when in eager mode.

The cool thing with this approach is that it can automagically run in eager mode and in graph mode, but you'll statically know which one it is, because you actively compile the model. And it'll read almost naturally. At least I hope that I can satisfy all these :D I mean, DL4S has provided some good things to build on for that.

philipturner commented 2 years ago

I was talking mostly about the lower-level aspects for optimizing memory management. In CUDA, you can often allocate GPU memory with zero cost. Same on the CPU using malloc if the allocated area was previously allocated by your program. But with Metal, it uses virtual memory. Every time you allocate it, it must be overwritten with zeros, even if you want to fill it with a different value. But there is still a way to allocate memory with an amortized zero overhead.

Metal has something called a MTLHeap. You allocate it once as a large contiguous chunk of virtual memory, then sub-allocate buffers without any cost. But they are more vulnerable to fragmentation than if you could allocate from device memory like with malloc.

If you allocate all of the computer's memory into one MTLHeap, then it will stop functioning. CPU processes need that memory on integrated GPUs, and other graphics processes need that memory for discrete GPUs. There is an option in CUDA to pre-allocate the entire GPU's memory for sub-allocation, but no such parallel for Metal. My solution is to have an exponentially growing list of heaps, accepting the fact that fragmentation may make it more wasteful.

To start off, you might allocate a heap worth 1 MB. Every time you need a tensor, you try creating buffer off of that heap. When the tensor deallocates by ARC, the memory becomes free for use again. And if the buffer is currently full, you then allocate a new heap worth 2 MB. If that doesn't work, keep going with 4 MB, 8 MB, etc. If you run out of device memory, the program crashes. To minimize fragmentation, every time you need a new tensor, try allocating it on the smallest heap possible.

I don't think OpenCL has anything like Metal heaps, so that API will just have to deal with the unnecessary overwriting of memory upon every allocation. The cost of this might be mitigated in graph mode.

AnarchoSystems commented 2 years ago

I was talking mostly about the lower-level aspects for optimizing memory management. In CUDA, you can often allocate GPU memory with zero cost.

That does sound to me like a very strong argument to just use CUDA, even if this might be conceptually even harder to understand. Sounds better to me than trying to work around such strange limitations.

My solution is to have an exponentially growing list of heaps, accepting the fact that fragmentation may make it more wasteful.

Couldn't you copy the existing buffers over to the new heap to avoid fragmentation and delete the existing heap? Sure, this would incur a one-time extra work for copying, but if you keep track of the largest buffer you needed to allocate so far, you can even deallocate the entire heap when no longer needed and reallocate it with this huge size. You could even create local "compute pipeline" objects in order to track the maximum memory consumption of individual sub-processes so you only allocate huge heaps if you run computations where you empirically know that they need that.

Meh, my first shot at the envisioned Symbol API appears to be a mess :/ I've put it in a branch and pushed it in case you want to comment on it, but right now, I think there are serious conceptual issues and I need something entirely different.

philipturner commented 2 years ago

That does sound to me like a very strong argument to just use CUDA, even if this might be conceptually even harder to understand. Sounds better to me than trying to work around such strange limitations.

The only time CUDA has zero-cost memory should be if you pre-allocate the entire GPU's memory. This is just like if you allocated the entire GPU's memory into one heap in Metal, so Metal is not much different than CUDA. When using CUDA for TensorFlow, they also have an option to allow memory growth. I think it's similar to my idea with an exponentially growing series of MTLHeap's.

Couldn't you copy the existing buffers over to the new heap to avoid fragmentation and delete the existing heap?

At that point, it might require an extremely elaborate algorithm with tons of heuristics and investment in gathering data to fine-tune. As with trying to predict which operators the backend will produce (https://github.com/AnarchoSystems/DeepSwift/issues/1#issuecomment-1128842665), it may be computationally infeasible. But maybe you could look into the lower-level aspects of established memory management systems to see whether they implement that idea. Or you could train a neural network to predict that, but ML takes an extreme amount of time and data to develop.

I've put it in a branch and pushed it in case you want to comment on it, but right now, I think there are serious conceptual issues and I need something entirely different.

I'm not likely to comment on the branch because my main focus is on my own project. Mostly, the conceptual issues should be something universal. If a solution exists, TF and Torch have likely thought of it and implemented it. It might be best to look at how you'll implement it in your backend first and foremost. After that, you could shape your frontend to reflect your backend.

AnarchoSystems commented 2 years ago

I think, I'm finally starting to ask the right questions, I just came across this. Maybe this will get me going with graph mode... The whole memory management thing seems to be the main problem here.

AnarchoSystems commented 2 years ago

reads Damn, this is one more instance where variadic generic types would come in REALLY handy...