Closed metawrap-dev closed 2 months ago
This is not an area I know much about would love to know more about the approach, @eduardoleao052 is this something you're thinking of using a dependency for or building this from scratch specifically for matrix operations ?
@medic-code after completing the TypeScript support, I'll study the best option to add GPU support. This means that the heavy matrix multiplications will be able to be run on a GPU, speeding up the ML models.
If I may be so bold, I would suggest the following approach. I'm wondering if we can outdo Python if we build from the ground up to focus on GPU integration and performance. My understanding is Torch and Python mostly have mapped corresponding routines (kernels) in the GPU, so in theory, if we take a from the ground up performance approach, I would like to think we can outperform classical Torch.
We add decorations inside the Torch-ish objects that allow a visitor object to assemble an entire compute pipeline, kind of a compute abstract syntax tree of what is to be executed that we can map data into.
The pipeline needs to be able to be processed as a whole or chunked along I/O and compute-friendly boundaries to allow us to optimise when running with different GPU availability.
If the whole pipeline will fit in a big fat GPU then we want it all in there. GPUs are getting bigger and fatter. Let's design for that. If we need to break it up into discrete steps with a small or multiple small GPUs, then we need to be able to do that as well.
┌────────────┐ ┌────────────┐ ┌──────────┐ ┌──────────────┐ ┌─────────────┐
│ Torchish │ │ Pipline │ │ Pipleine │ │ Orchestrator │ │ GPU │
│ Objects ├──► │ Assembler ├───► │ AST ├────►│ ├───► │ Gateway │
└────────────┘ └────────────┘ └──────────┘ └──────────────┘ └─────────────┘
The executing Can take Representation Takes the Takes chunks
object the planned of what needs pipeline and of the AST
execution and to be computed determines how and data and
turn it into it will be executed executes in GPU
Pipeline AST against the GPU
Make use of ReadableStream
As a kind of proof of concept and possibly even the solution, for version 0 we take the AST for the pipeline and build a function that can be passed into https://github.com/gpujs/gpu.js Let it do the heavy lifting for now. If we need to do something else in the future, we can replace this step with something else and keep the rest of the design and implementation.
@metawrap-dev I like the spirit, and you clearly have a great understanding of hardware manipulation on Pipelines. I think we could make it work, and adding at least the gpu.js
is a clear first step in my opinion. If a simple GPU acceleration is put in place, it could already allow a much wider range of applications for the library.
Thanks!
To give a toy example: add(mult(w,x), mult(y,z))
We really want to build an object tree that represents the promise to execute this.
That way, instead of passing (w * x)
and then (y * z)
and then waiting for the result of these and then passing r1 + r2
to the GPU, we could more efficiently pass (w * x) + (y * z)
in one go and get one value back.
So we go from
sequenceDiagram
participant JavaScript
participant GPU
JavaScript->>GPU: w * x
GPU->>JavaScript: r1
JavaScript->>GPU: y * z
GPU->>JavaScript: r2
JavaScript->>GPU: r1 + r2
GPU->>JavaScript: a
to
sequenceDiagram
participant JavaScript
participant GPU
JavaScript->>GPU: (w * x) + (y * z)
GPU->>JavaScript: a
This will work efficiently when we pass a torch-ish object into another's method as a parameter
Not much if we pass in just a scalar/primitive unless we make a special class, eg a Number class that would allow a whole tree to be linked. We would, of course, make it as granular as possible (no point in building the internal AST of a matrix operation).
If the library is skewed towards being able to build the largest connected Pipeline tree, then we have the ability to pass the whole intention into the GPU and optimizers and have them work their magic.
Cool! I could see that working building upon the Tensor
and Module
objects. The Module.forward()
method is always spelled-out before the model is actually run. Then we could build this "graph" with all the operations, and send them in parallel to the GPUs!
I'm currently finishing to add TypeScript support to the library, and will push it to main soon (currently it's in the develop
branch).
Would you by any chance be willing to work on a PR, so that we can try to implement the GPU.js
support? I'd love to take on that challenge!
Sure! Once you complete the TypeScript support. I do a lot of prototyping in TypeScript, so it's my happy place.
In the meantime, I will look into implementing the toy example above and try out some design ideas.
These two features in gpu.js
are likely going to be load-bearing for us.
https://github.com/gpujs/gpu.js/blob/develop/README.md#combining-kernels https://github.com/gpujs/gpu.js/blob/develop/README.md#create-kernel-map
Some notes here for things I want to investigate.
I seem to have proof of concept for item 1 on my list.
I can build a toy pipeline graph, display it unresolved and resolve it. I will put this up in a repository tomorrow.
Tomorrow, let's see if I can pass some of it into gpu.js.
That's awesome, looks like a great start! Btw I'll merge the TypeScript branch into main today, I'll let you know.
@metawrap-dev Merged the TS into the main branch, but there are still some mistakes, I'll solve them today.
Thanks. I will take a look. If you have any typescript questions I'm happy to help there as well.
I'm going to continue in a different repository until I have the major questioned I outlined above answered.
Today I'm going to tackle the streaming and batching big data question. Big as in more than can fit into memory. The goal (i) will be to have a toy versions of a model load from disk with chained execution operations and then output to disk.
I want to be able to get to the point where I execution is triggered only when we want to save the result. So, complete lazy loading and execution with the ability to inspect the pipeline graph.
Next after that the goal (ii) will be to have a combination of granularities so we have some operations that will be handled by a prepositioned kernel and then some that are weaved together with permutations in code. Those will be handled by less granular code generation.
At that point it will be possible to create a visitor class that can analyse and chop/chunk and transpile the pipeline graph.
I can probably only get some of (i) done today given the time I have as just coding up the testing and plumbing is going to be a majority of the work.
Cool! If you want any help at some point, just let me know. About the TypeScript, the only issue left to solve are a couple of type: any
s and unused variables. Just a matter of keeping the code cleaner.
If you can avoid any
, it would be a Good Thing™️.
If it is the case where you don't know the type yet in the flow of execution and it will be resolved later on, then using unknown
rather than any
is preferred.
If it is just a case of trying to get things to match and any
is used to get past that, then I can definitely help with that.
Here is my work so far.
I'm currently removing all the any
s, don't worry!
Most can be corrected simply, if there's some syntax on a specific part that I just don't know I can let you know, thanks for offering the help! I took an overlook of the code so far, really interesting stuff. I'll actually read it tomorrow, but from what I understand you're implementing a Number class.
Will this in the future compose a Tensor class for tensor operations?
The DataNumber class is just a start and just a minimalist toy model. Yes, higher order objects like tensors will be modelled the same way. I'm just starting very small as there is a lot of type semantics to work out to make everything composable before I move onto something more complex.
I've updated the repository. I've also implemented some more complex composition involving sources. However, I'm still only using primitive number types and the multiply operation for now. My goal is to get the streaming working before I start doing anything more complex or more granular.
In memory, Source
and Destination
were added. I haven't yet tackled I/O and streaming, but the functionality I have so far models a lot of the problem and solution in memory.
E.g. in one test case.
{DestinationMemory(0 stored, 1 in buffer, 5 batch size) <= [( multiply {SourceMemory(4 elements, 0 index, 1 batch size) <= [10,10,10,10]} => unresolved )]=>[]}'
Resolves to
{DestinationMemory(1 stored, 0 in buffer, 5 batch size) <= []=>[10000]}
This is, in effect, loading computing and saving the result.
We get to see the pipeline graph before it is resolved.
I'm not sure if I should strictly type these or use the "sausage machine principle" and assemble the correct type from arbitrary length and format input arrays.
resolve() is going to be invoking any transpilation. At the moment I am just executing everything in the CPU.
Optimal batch sizes can be discovered on the fly. At the moment, they are there to force the pipeline to execute when data limits are reached to emulate a shadow of some kind of real-world behaviour. I'm thinking of some back-pressure signalling up from the GPU from each kernel and the GPU in general.
When complete and if it can be demonstrated to be usefully performant, the goal will be to take the learnings from my repository and transfer them, in style, to yours.
If it all plays out well, we should have something that could, in theory, outperform Python, at least within striking distance.
The next step is creating some actual disk I/O-driven Source and Destination implementations.
After that, some matrix operations. One could be implanted as a granular kernel, and the other could be built up with primitive operations. That should allow some serious testing with code generation.
I'm super busy until the end of the week, but will see what I can do in between things
Gathering some more thoughts for (i)
I'm close to being able to serialise and deserialise the execution state. Source
and Destination
batch themselves now. What is missing is generalising this to the Compute
elements. I'm going to tease out the internal run-state of each element into something common that can be marshalled.
What will that get us?
I'm also considering adding an unrolling strategy as the same kind of parameter as batch size. With this, the system that is trying to optimise the pipeline graph will also be able to experiment with unrolling.
And that implies that metrics are going to be needed.
The optimising system observing the execution needs to be able to get metrics at each point in the pipeline and maintain some previous data points to make a comparison. Enough to give it the ability to converge, but not too much that it overwhelms the task at hand.
Strategy
object? Something that can be introspected and modified even just using dumb random search. The element can take a strategy and apply it to State
. Strategy
or separate 'Config' for this? State = Runtime + Config? Implemented Strategy
, Config
, and State
across all the elements as described above. Strategy
needs to be put into practice, so what is there is a placeholder for now.
So close to 100% :)
Tentative flow control so that operations asynchronously will block until data becomes available.
It is time to add more complex Data
and Compute
elements to refine the design as single number flow seems done. I've exhausted what I can do with these now.
The goal is to make sure that execution only occurs after resolve, but there are situations were populating Source
or Destination
will start to trigger flows.
The design looks like it is holding together and will do what I want. More complex objects should now shake any bugs or design issues with the way I want data to flow and obey batch/chunk sizes. It should allow me to test the roll/unroll strategy ideas, as I need a Compute element with at least a few layers of execution to support that.
These are really solid groundworks. Been looking through the repo, and it's impressive stuff. I think adding more complex objects is the next big step. If you need any help, let me know!
Thanks,.
Yes, now that I have the flow and semantics thought out and lazy execution working, more complex objects are now needed.
Just thinking some things out loud again here.
Code-Gen/Optimizer:
Strategy
. At a point that maximises throughput. Treat the GPU as I/O.Compute
elements will have control over unrolling based on Strategy
. Built PG will depend on Strategy
Destination
? Reset vs Revisit resolve()/Resolved semantics (self-clocking on batch). We may need to mock behaviour for now, but it is important that this is designed in at ground level now. Two modes. Complete reset + Cumulative. Mock Compute
and Data
elements.
This should give enough. The plan for today is just to get the new mock data elements in. Probably all I have time for.
I suspect the design won't survive the first data element without changes, but here goes. : )
DotProduct4
A lot to unpack here.
Great success.. but..
From a flow aspect, there are several ways to computer dotN. Not all of them are obviously nonsense. Also this is going to be the same for all operations.
This adds a lot of fragile complexity in having generalized arguments.
I'm wondering if I can create IStreamify
Time to dive deep into some of the latest typescript features.
The plan =>
ISourceableCompute
element and adapts it for streaming inputs in from a source. May require TS type magic or specialised classes for each element.
IDestinationableCompute
element and adapts it for streaming outputs out to a destination.
I can probably create some generalised versions of these and, worst case, we end up with a pair per Compute
element but it will move the complexity of different steaming different styles into these distinct adapter objects. (A Good Thing ™️ )
Same approach for IPersistent
for creating persistent kernels. Same as above but for the whole pipeline graph.
I tried the above. It works. I ended up with the more sensibly named IStreamer
as we don't seem to need to for Destination and what it effect does is stream the data in.
The new method simplifies the code because Compute
elements now don't need to be aware of how to batch and don't need to deal with argument overloading with the ISource
const streamer = new Streamer<v4, number>(
new SourceMemory<v4>([1, 1, 1, 1], [1, 1, 1, 1], [1, 1, 1, 1], [1, 1, 1, 1]),
new ComputeLength4<v4>())
Creates a pipeline graph of
{Streamer{SourceMemory(4 elements, atoms 4, 0 index, 1 batch size)
<= [[1,1,1,1],[1,1,1,1],[1,1,1,1],[1,1,1,1]]}
=> {ComputeLength4{SourceMemory(0 elements, atoms 0, 0 index, 1 batch size)
<= []}=>unresolved}}
resolves to
{Streamer{SourceMemory(4 elements, atoms 0, 4 index, 1 batch size)
<= [[1,1,1,1],[1,1,1,1],[1,1,1,1],[1,1,1,1]]}
=> {ComputeLength4{SourceMemory(4 elements, atoms 0, 4 index, 1 batch size)
<= [[1,1,1,1],[1,1,1,1],[1,1,1,1],[1,1,1,1]]}=>{DataNumber
<= 2}}}
The alternative original is
const compute= new ComputeLength4<v4>(new SourceMemory<v4>([1, 1, 1, 1], [1, 1, 1, 1], [1, 1, 1, 1], [1, 1, 1, 1]))
Where we pass in the source just like any other parameter.
And would get something like multiply.
(multiply{SourceMemory(3 elements, atoms 3, 0 index, 1 batch size)
<= [{DataNumber <= 10},{DataNumber <= 10},(multiply{SourceMemory(2 elements, atoms 2, 0 index, 1 batch size)
<= [{DataNumber <= 10},{DataNumber <= 10}]}=>unresolved)]}
=> unresolved)
...and resolves to.
(multiply{SourceMemory(3 elements, atoms 0, 3 index, 1 batch size)
<= [{DataNumber <= 10},{DataNumber <= 10},(multiply{SourceMemory(2 elements, atoms 0, 2 index, 1 batch size)
<= [{DataNumber <= 10},{DataNumber <= 10}]}=>{DataNumber <= 100})]}
=> {DataNumber <= 10000})
The advantage in either could be when it comes to code generation.
So I will sit on it for a while as I try more complex Data
and Compute
elements combinations and then make a decision when I move to code generation.
TypeScript's type system is doing some heavy lifting here.
type v4 = Vector<number, 4> | DataVectorN | (number | IData<number>)[]
const source = new SourceMemory([1, 1, 1, 1], new DataVectorN([1, 1, 1, 1]), [1, new DataNumber(1), 1, 1], [1, 1, 1, 1])
const compute = new ComputeLength4()
const streamer = new Streamer<v4, number>(source, compute)
Some more thoughts.
Streaming in such a way that engages elegantly with TypeScript's type system but remains performant; There are few ways to go.
The main issue is argument cardinality and the marshalling stage into an Compute
element with different argument sizes.
Either the Streamer
needs to know how many arguments it is dealing with, or the Compute
element needs to know.
Source(2)=>Streamer(2)=>Compute(2)
ICompute<I,N,O>
This needs to be expressed in the type, the type system will flag a mismatch so N needs to propagate properly. It will be used to drive Vector
The code is bit of a mess right now because I have a few ideas in play simultaneously and not all of them will be the winner.
Streamer vs no Streamer: Compute
input becomes Vector
vs Source.
Sausage Machine vs Structured.
Sausage machine (or some variant) may actually be the best way forward and the most flexible.
This is all about being able to chunk up code and data into big blobs that we can blat into the GPU without a lot of back and forth. Ultimately all the design decisions are driven by that goal.
@metawrap-dev its really fascinating to see your thought process, I've just been wondering if you had any ideas if there's a better way to document this other than an open issue ?
For example maybe an open PR with a readme.md, or a wiki page or something ? It's okay if you feel this way of documenting your thoughts is fine. I just imagine that after a while you might want to collect your thoughts in a summary and rationale so far. I realise that this is still a part of discovery as you go and you may want to postpone documenting in a more formalised way. Just some thoughts.
Yep, eventually. At the moment, I'm still experimenting. I assumed here was good because others could chip in. It just kind of happened organically as I write down most of these thoughts while mobile and then refer back to them as I work, so very convenient for me.
I will gather my thoughts elsewhere.
I'm definitely not suggesting you should gather your thoughts else where outside of github, as i said its very useful to see your thought process/allow others to contribute. I would imagine that you may find trying to scroll back through more than 29 posts a little more challenging as you keep documenting, that is all.
29 (now 30) posts are fine. Scrolling is fine. That's not a long issue; this is a long issue. 😁 But we finally got there.
Hopefully, I will be finished before this gets that long.
I have the types mostly tamed. Quite happy with it.
For the initial toy system, all the tests and examples are just placeholders for fundamental but larger patterns. I have a generalized way of defining a Compute
element. The pipeline can be constructed out of resolved (static) or unresolved (Compute
, Source
or Data
with unresolved or static elements. )
Compute
elements can only take arguments of the same type, but I am considering generalizing that when I move out of the toy stage. If I can keep it this way, then it will have some advantages.
I have temporarily removed Streamer
as I think I need something more general.
I have examples of different styles of Compute
elements. I have been playing with the types system and elements to see how they mix (vector columns vs vector rows). My instinct tells me that telling the difference at this level will make optimization better, instead of throwing away information, but it makes the types stricter and safer, which is good for now.
I'm now deciding if I want to create Compute
elements that are Source
aware and can "accumulate" an answer or if I want to make another class like Streamer
that handles that state and the accumulator. That would separate some of that complexity out of Compute
elements as I could render them stateless. They are half in, half out at the moment.
It would be nice to be able to construct this behaviour out of the types system as I have for most of the existing Data
and Compute
functionality.
Take MultiplyN. Dimension 1, Cardinality 0 ("Horizontal/Row" Unlimited Array) => Dimension 1, Cardinality 1 (Scalar)
(I'm using 0 = Unlimited, 1 = Scalar. Might change to Infinity, 0 later but not tested how nicely Infinity plays with the type system yet)
class ComputeMultiplyVN extends ElementCompute implements ICompute<number, 1, 0, number, 1, 1>
If I multiply a long Source<number,1,0>
of numbers (effectively a long row vector), I get a single number as the output. The goal is that this process can be suspended and resumed ideally not within ComputeMultiplyVN
as I want to move that state out. Hence the new Streamer/Generatator/?
If I feed Streamer(Source,ComputeMultiplyVN)
a Source
of vectors eg , Source<number,4,0>
then It should output an array of numbers, each number representing the terms of each vector multiplied.
Also. taking a VectorAdd Compute
element that can add two vectors. Can a Streamer
take that and construct something that can not only add N vectors, but is able to finesse the code generation to take advantage of a hardware operation that or a sweet spot in performance where adding four vectors at once is better? That last bit is the issue because something needs to supply that optimization domain knowledge. Can I generalize that? I think I can.
I'm interested in the general solution of constructing via generics and common base classes the accumulator and separating more state out of compute and including the optimization for the code-gen. I need some kind of lambda operator/compositor that can dip into logic and a factory class to provide optimizations.
If you get my drift :)
The other is related... the internal code generation where a compute element can create an internal representation of what it will be doing out of other Compute
elements.
eg. Multiply, based on input from the strategy (unroll/batch etc) either declares a single-grained intention to multiply,or generates a course-grained sequence of mul8/mul4/mul3/mul2 Compute
elements when it knows the length or even creates really fine-grained code where every multiply is represented as an internal Compute
element. Having the Strategy
and Tactics
for the Compute
element control this, and integrating this with Metrics
that allow the code gen to be moderated. When we resolve()
, it might result in repeated code-gen at different parts of the pipeline until some optimum is reached.
That is the goal anyway.
So this will be a fun bit.
Like I said before, all these examples are just mini toy versions of what the system will need to do on a larger scale.
I have a naive type signature implementation that will compose. In this form the generator's generic/template parameters can't be completely inferred from the input types as I am trying to be flexible on the output stage. I will probably need to remove this flexibility and always return a standard form that can be derived from the source type.
it('Generator matching 1:1 compute signature', async () => {
// A source of pairs matching signature of compute element.
const source = new SourceMemory<number, 4, 2>(
[
[10, 10, 10, 10],
[10, 10, 10, 10],
],
[
[10, 10, 10, 10],
[10, 10, 10, 10],
],
)
// The compute element that will take input from the source
const compute = new ComputeAddV4()
// The generator
const generator = new Generator<number, 4, 2, number, 4, 2, number, 4, 1, number, 4, 1>(source, compute)
// If the generator's generic/template parameters can be inferred from inputs then we can have this
// const generator = new Generator(source, compute)
// input pairs output vectors
// <number,4,2>[] => <number,4,1>[] = <Output<number,4,1>,1,0>
// This generator effectively becomes a new source or can we make it appear to be a new compute element?
// Does it matter as long as it has a type and matches signature of IResolves<Output<number,4,2>,1,0>
// Which is the type integral of the source's output.
expect(generator).toBeDefined()
})
Generator
looks like it will work. In the current experiment, I am looking at still allowing Source
to work for individual arguments and Generator
for collated sets of arguments. I am tightening up the types with a view to simplifying the classes.
The required code in a Compute
element has been slashed. A bit more type and base class algebra should get it down to something much more manageable than it was when I started.
Generators that need to transform the Source
to suit the Compute
element as well as Compute
elements with heterogeneous arguments will be the next challenge. Varadic Tuple arguments and partial argument consumption may work for the latter. I have an idea. Still thinking about how to finesse this with the generator such that the type system performs the maximum amount of lift. What will come out of this, at least, is a proposal with a concrete example for TypeScript to support materialising value instance type generics arguments accessible as constants.
I am backtracking on horizontal vectors as they add a disruptive fog to singleton vs vector vs spread argument binding as I am going through the process of harmonising all the code and moving it into base classes. I will add them back if I can make everything else work or will sacrifice them if they are too hard to keep. I would like them to contribute to optimization knowledge as one would typically get better CPU/GPU cache performance over a contiguous, non-striding source. I may try to hold that knowledge in the object itself in the Strategy
if I can't keep it at a type/class level.
In the meantime, plan B is Compute2
Compute3
Compute4
, etc, where N is the number of types supported by the compute element.
Toy pipeline with Generator
seems to work, Just with matching inputs for now. A Generator
element will take a source, pass it through a single Compute
element and generate a new Source
that contains all the output of that process.
If I create a Generator
{Generator(0 elements, 0 atoms, 0 index, 1 batch size){
SourceMemory(3 elements, atoms 3, 0 index, 1 batch size)
<= [[[1,1,1,1],[2,2,2,2]],[[3,3,3,3],[5,5,5,5]],[[7,7,7,7],[11,11,11,11]]]}
=>ComputeAddV4=>[]}
.. and then write it to a DataStore
it sucks the data out from the Generator (as a
Source), triggering
resolve()` down the chain and we get..
{DestinationMemory(3 stored, 0 in buffer, 1 batch size)
<= []
=>[[3,3,3,3],[8,8,8,8],[18,18,18,18]]}
So, I'm going to keep things simple now and try to add a stepping stone towards code generation.
This will be like a Generator
but will take a sequence of Compute
elements in a complex array. In effect a multidimensional pipeline. It will in itself be a compute element. Or I may be limited to just composing two at a time. I will see how far I can push the types system,.
Super busy at the moment. Hopefully back into this soon.
Also relevant.
https://github.com/HigherOrderCO/Bend
I was leading towards a similar tree-based flow. I'm going to study this and see what can be gleaned.
Here I was thinking that the way I was approaching some of the composition was weird and bordering unacceptable, but it appears that WebGPU does something similar. I will borrow some terms from here. :D
https://google.github.io/tour-of-wgsl/types/vectors/constructors/
It's not an issue - I'm just here to say I love this.
I want to see this get to the point where it will run large transformers.
For acceleration, you might want to consider something like https://github.com/gpujs/gpu.js, maybe?
Best of luck.