Feature Request: GPU Compute

MajorGeneralRelativity commented 7 years ago

Allow really slow GPU computing using OC graphics cards. Just like in real life, if you can parallelize the code, it would allow a performance speed up. Unsure of whether this capability should be exposed directly as OpenCL, or if it should be done in Lua and translated to OpenCL (not sure if that is possible?). All suggestions are open, but I think that allowing access to even a very slow GPU would allow some programs to be sped up tremendously.

xarses commented 7 years ago

Feature requests should include several use cases to help understand what you are attempting to accomplish, you should also describe why the existing implementation doesn't meet these uses case for you. Finally, please identify how this feature would be different from the existing graphics feature requests, #779 #1901

MajorGeneralRelativity commented 7 years ago

@xarses As far as it being different from #779 and #1901, it does not involve graphical operations or VRAM (although VRAM would probably be cool). It is merely for involving the GPU in compute operations (think NVIDIA TESLA cards or similar). This would allow for workloads that are parallel to be sped up.

Some use cases:

All the stuff people use GPU computing for in real life.
I have an encryption system that could be made parallel for a potentially massive performance boost, especially because an update will make it far less I/O bound.
I've seen some work on hard drive data compression, and I believe that can be made parallel too.

xarses commented 7 years ago

I still don't understand. People IRL use GPU based compute because the GPU is excessively efficient for complex math, especially related to polygons and for these types of computations its orders of magnitude more efficient than a general CPU. We don't have any of these GPU functions as they are artifacts of 3D graphics functions. (Our OC graphics cards are barely comparable to EGA graphics)

The closest equivalent thing we can do here is implement LUA intensive algorithms in JAVA / Scala and create a component to bind them. for example, this is what the existing datacards from OC or the encryption cards from computronix do

From the above, you seem to want to just do multiple LUA threads, which already works in OC

NickNackGus commented 7 years ago

Using multiple computers communicating in a network for parallel compute isn't feasible due to the latency from the network. Using redstone signals with redstone cards and computers or robots placed in adjacent blocks isn't space efficient for a start, and is also too slow for parallel compute.

I don't think that GPU based parallel compute units would fit the theme of the mod (1980s computers). However, there are exceptions, such as the 3D printer and holographic projector. In addition to that loophole, there was technology available at the time for certain parallel operations to be accelerated.

But before I list those options, I must add that, for most cases, parallel computing will require a language designed for parallel computing.

Being a computer engineer, I'm leaning more towards the hardware design methods of parallel computing. There are languages that are used to "program" hardware (connect different logic gates together as physical hardware). VHDL was created in the 1980's. Verilog, which is much, much easier to learn and use (in my very strong opinion), appeared in 1984. Both languages can be used to develop ASICs, in which a hardware design (as VHDL/Verilog, or at a lower level that is harder to do) is converted into a bunch of masks. The masks are very expensive to make (millions of dollars), but act as a template for very inexpensive parts (potentially tens of parts for every penny spent). This is how many of the chips for 1980's computers came to be so small and inexpensive (ie, that consumers could afford them at all).

I cannot guaranty that it would work in any useful way, but a simpler solution that wouldn't involve a major update would be some form of parallel compute unit; a specialized microcontroller of sorts, with some sort of parallel port on each side for rapid data communications. Ideally multiple of them would be able to fit within the same block (such as 2x2x2 within a block, or otherwise having them function within a special inventory block for a grid of 8x8 or so on). It would need to be able to run at least once per game tick, and would be appropriately limited by itself. The idea is to have many of these parallel compute units placed directly next to each other, able to hold only a single number at once (maybe two or four for higher tier?), and reading the directly neighboring compute units as though they were internal to themselves. I would also recommend some way of grouping the parallel compute units, such as dying them, so that instructions could be sent to some of them within the network, but not others.

For the sake of easy reprogramming, a parallel compute controller would need to be adjacent to the parallel compute units somewhere, acting as both an interface for communicating with other types of computers, and the method of issuing instructions to the parallel compute units. This would allow for Single Instruction, Multiple Data compute, similar to graphics processors, encryption processors, and other high bandwidth data processing.

An example program could look something like this:

if (myColor == blue) then
  if (right ~= nil) then
    self = right
  else
    self = 0
  end
end

Which would shift data from the right to the left for all blue parallel compute units.

If I'm honest, however, I feel that parallel compute is unlikely to be a part of OpenComputers, and more likely to appear as a mod that can interact with OpenComputers, due to the complexity involved.

MajorGeneralRelativity commented 7 years ago

@xarses I'm aware that OC's graphics cards are pretty low tech. However, even if the GPU compute runs really slow by today's standards, it would still probably be able to run much faster than an OC CPU.

@NickNackGus You're right that we would need a language for parallel computing, and that we would need something dedicated for the task. I suppose your method would work as well, I was just thinking about the first parallel computer that came to my mind.

xarses commented 7 years ago

No. Again, there is no way to run any code in the OC GPU because there are no functions other than implicitly working with the color or text on the screen. The reason that "GPU Compute" works IRL is because of the GPU being a fully function CPU AND having extremely efficient complex math functions AND having it's own memory space to execute these functions against. This runs 'faster' than a CPU, because a CPU level implementation of a function can run in far fewer clock-cycles than a software function that uses simpler functions on a CPU with out the complex math functions implemented in the GPU.

So in order to do this in the OC graphics system we would first need:

add memory
add functions for manipulating the memory at a basic way (get, set, copy, xcopy)
add functions to do special math over a the memory and are encoded into the java implementation

But what it sounds you are looking for is gird-compute, and optimized functions. This isn't implicitly related to the GPU.

Grid compute would be implemented at a software level and the needed primitives already exist in OC's LUA interpreter

Optimized functions. IRL this is accomplished by using a special purpose hardware module that can perform functions faster than the simpler CPU primitives in a general purpose CPU. We regularly offload things to dedicated graphis, or encryption modules. In OC, this is accomplished by moving something from LUA which is interpreted at runtime and executed by the JRE so that it's natively compiled and executed in Java. In complex expressions, like encryption this results in many magnitudes of increased performance.

MajorGeneralRelativity commented 7 years ago

I'm aware that I'm not asking for anything that couldn't be done on the CPU, but having GPU compute functionality would make it much faster. It also would help to get around the fact that Lua does not support true multi threading, and transferring data over a network introduces significant overhead that would be unsuitable for small instances. Edit: I'm aware that this functionality would have to be added to the OC GPU and does not exist right now.

NickNackGus commented 7 years ago

Hang on - are you asking for OpenComputers to emulate parallel computing, or to actually use parallel computing? Keep in mind, the servers people rent to host MineCraft and mods such as OpenComputers may not support parallel computing. Many of them don't need or have actual graphics cards, and a fair number of them use virtual machines to host many MineCraft servers on their physical servers without allowing MineCraft server owners to access each other's information. True parallel computing would be impossible to implement for many server owners.

MajorGeneralRelativity commented 7 years ago

@NickNackGus That was the main issue I was wrangling with. In a perfect world, I would like OC to attempt to actually use parallel computing, with emulation as a fallback. Even with emulated parallel computing, but still multiple threads on one OC computer (making use of the physical server's multithreaded CPU), it would allow for a performance speed up. The capability for actually using the GPU when available would be incredible though.

xarses commented 7 years ago

I'm aware that I'm not asking for anything that couldn't be done on the CPU, but having GPU compute functionality would make it much faster.

Make what faster? GPU compute works IRL because because GPU's are better with complex math. It doesn't make a generic program faster. It makes complex math data processing faster. This is a narrow use case. What exactly do you think will run faster in OC because of this?

It also would help to get around the fact that Lua does not support true multi threading

A lot fewer things are truly multi-threading than you expect. Instead most are context yielding and relying on a fast enough cpu that you don't notice too much. regardless, you can run multiple code paths in at the "same" time in the current LUA-bios. The implementation in Java results in most of the code (given the server isn't too busy) are is run in separate threads. This is as functional as a multi-threaded env that will ever be implemented for OC.

transferring data over a network introduces significant overhead that would be unsuitable for small instances

Eh? the whole point of grid compute is to give enough work to the endpoints that this isn't a problem. if the work is that small, you can just run it on the leader, otherwise you have to eat this cost as its part of the problem.

MajorGeneralRelativity commented 7 years ago

What will run faster?

The encryption algorithm I'm developing.
The targeting algorithm I'm developing that could be operating on hundreds of targets simultaneously.
I'm sure other people would do some cool stuff too.

As far as your comment on grid computing, some workloads would be good for that, like my encryption algorithm, but I don't expect it to be used like that, except for edge cases. Having GPU compute would pack all that power in one computer

gamax92 commented 7 years ago

You're literally just trying to get around OC's intentional speed limitation by adding faster CPUs in OC and muiltiple CPUs per computer. Calling it "GPU Compute" doesn't mean anything and any implementation of this in OC would bare no resemblance to real life's GPU Compute due to needing to be user friendly, sandboxed, and slowed down to not overload the server. Kinda like how current OC CPUs are.

I agree that this is unnecessary and doesn't fit the style of the mod, if you need multi threading go run multiple OC computers together over a network. If you need fast real encryption, the data card has some encryption methods you can use.

ds84182 commented 7 years ago

I'm actually partial to this. It seems great, but the problem is the implementation. A different approach would be SMP by allowing multiple CPUs inside of a machine as a coprocessor, where the coprocessor is implemented as a card. Then you can bootstrap and send messages to it as a component. It would be a bit easier to set up than an array of Microcontrollers. The only unfortunate bit is that the coprocessors would have to share memory with the main computer, and that the coprocessors would have selective access to components.

MajorGeneralRelativity commented 7 years ago

@gamax92 Real GPU computing wouldn't get around the CPU speed limit because only some things can be sped up that way. As far as user friendliness, why does it need to be user friendly? Real life GPU compute programming is very hard (or so I have heard).

MajorGeneralRelativity commented 7 years ago

@ds84182 If we go the multiple CPU route, they should have separate memory, and you would have to deal with a NUMA setup for extra complexity.

xarses commented 7 years ago

The encryption algorithm I'm developing. complex math outsourcing The targeting algorithm I'm developing that could be operating on hundreds of targets simultaneously. complex math outsourcing I'm sure other people would do some cool stuff too. probably more complex math outsourcing

Again, I think your vastly mis-understanding why "GPU compute" is "faster". This is because of complex math offloading onto a specialized chip (your graphics processor)

This kind of speedup isn't needed in OC's GPU as we don't support complex 3D, so there is no offloading. So its not a never, but it's very much not in line for the current direction of the project, as I noted the afore linked issues are requisites (and other 3D support) before adding math offloading can be considered.

However, complex math offloading is something that can be supported, and can even be addressed as add-on mod. see computronix encryption modules. This is accomplished by moving something complex out of the high level, but costly to invoke LUA programs that you run in the computer, to something that executes directly in java. This is the OC equivalent as running in hardware.

Your separate objective of real multi-threading in the OpenOS / LUA-bios would have to be done separate of this, as "GPU compute" is not the same thing as multi-threading

MajorGeneralRelativity commented 7 years ago

I'm aware that GPU compute is not the same as multi-threading, although I understand your confusion based on how I was talking. GPU compute is the driving focus of this feature request, while multi-threading through SMP or multi-core OC CPUs is a bonus feature, and I suppose should be discussed elsewhere.

I'm aware that the OC GPU currently doesn't do any complex math outsourcing, but I would like the feature to be added. Even if it was only equivalent in performance to something like an NVIDIA 8800GT or an AMD 2900XT (both ~10 years old), it would almost certainly vastly outperform an OC CPU on complex math, while not annihilating a server's GPU, even if it is something like an Intel HD Graphics 530.

I would say it's not completely out of line with OC's theme, as GPU computing goes back even further than the above graphics cards, and OC does have futuristic stuff like nanomachines, hoverboots, drones, and holographic projectors. These were not available in the 80s/90s, and some of those are still not available today.

payonel commented 7 years ago

I support what @gamax92 said, you're essentially asking for more cpu power. Or perhaps you're just asking for more cpu instructions for "complex math", which essentially would become more math api. As the original request of this github issue is explained - we'll have to decline the request. If you have a specific math function you'd like added to the data card, or even the gpu, feel free to make it as a separate request. But for requests to run in parallel? no, oc machines are going to only have 1 machine thread.

MajorGeneralRelativity commented 7 years ago

I'm not asking for any more CPU power at all. I'm asking for OC to be able to use the GPU which is in almost any computer running MC (unless there was some crazy high end servers with Xeon E5/E7s running MC). Requesting specific math functions wouldn't really work too well, because I want SIMD functionality so that I can rapidly manipulate dozens of targets simultaneously, which would greatly reduce the number of zone controllers or encryption nodes I would need to accomplish a given task.

gamax92 commented 7 years ago

Look at this suggestion from the point of as a feature that everyone who uses OC can also use:

You want to have apparently unrestricted access to potentially non existent hardware (and no, you don't need high end Xeon's to have a lack of a GPU, many servers out there either don't have GPUs installed or run in split virtual machine resources with no exposed GPU access), or even access to hardware that whatever current user a server is running on has no permission to access.

payonel commented 7 years ago

even if we gave access to the real gpu through an api, there would have to be a software fallback. regardless, this comes down to "more processing power" whether parallel or serial. fwiw: I don't have a xeon in my server, and it has no gpu. SIMD could be handled through an API, just because it is some math api doesn't restrict how we implement it. You are welcome to open a ticket for SIMD api on the data card, but again, we're not adding parallel processing to a single oc machine.

MajorGeneralRelativity commented 7 years ago

The software fallback could still make use of the fact that just about every CPU on the planet has at least 2 cores, which would still provide a performance speedup. Not even close to what a GPU can manage, but enough for the fallback to be viable, while GPU enabled servers/users would be able to execute those programs at a faster rate.

It also isn't just "make the OC CPU faster pl0x kthnx". It offers a challenge for advanced users to make use of a co-processor that can offer a tremendous speedup if they put the effort in to make their code parallel, which is far more difficult than making code that just works.

MajorGeneralRelativity commented 7 years ago

In the light of some fuller discussion on the challenges of ensuring that accessing a GPU doesn't lock up the system, I propose that we leave the issue open while I work my way down my programming queue and get to learning OpenCL so I can provide some solutions. In the meantime, I would like others to chime in with suggestions on how we can make this work.

gamax92 commented 7 years ago

It really should have been closed a long time ago.

MajorGeneralRelativity commented 7 years ago

But if I can learn OpenCL and come up with a convincing proposal, it would prevent me having to open another issue

skyem123 commented 7 years ago

My thoughts on this after seeing @MajorGeneralRelativity and @payonel talk about this on IRC.

I think that the GPU should have VRAM that can be used for different purposes, such as stuff being displayed, extra buffers (double buffering), memory to peek/poke for fun, and "shaders".

Shaders would basically be programs that run asynchronously to the Lua code. I'd recommend some sort of simple virtual machine with a simple ISA and Harvard architecture. There would be two types of shaders, "compute" and "graphical".

"compute" shaders would basically be useless except for trying to squeeze more performance into OC which is where the fun is.

"graphical" shaders would be able to directly affect visible graphics stuff (and maybe extra buffers?), and would run on the Minecraft client so that the network won't be melted when you edit stuff quickly. Synchronising would be a pain but it would be neat.

Using "real" stuff such as OpenCL seems like it'd be a security risk at the least. It's worth noting that there are security issues directly related to real life VRAM not being cleared after it's finished being used by something. Using real GPUs with OpenCL it breaks the fun people get with trying to squeeze performance from a virtual low powered machine, look at PICO-8!

That's my thoughts so far...

MajorGeneralRelativity commented 7 years ago

I'm not against implementing these in OC, I just would prefer them to not be called GPU compute, because they don't use the real GPU. As far as squeezing performance, I wouldn't propose letting an OC computer use 100% of the IRL GPU, as that would cause massive performance problems. Instead, they should be limited to a small amount to simulate a GPU that is around 1 decade old, or maybe more.

skyem123 commented 7 years ago

GPUs sadly do not really work like that. Also, it should be GPU compute because the graphics card component is called... gpu, so... it's obviously computation with the gpu component.

MajorGeneralRelativity commented 7 years ago

@skyem123 Graphical shaders should be called just that, but compute shaders should be under a "Math API" or somesuch, because it doesn't really use the GPU in OC or IRL.

skyem123 commented 7 years ago

Well... if "compute shaders" are based off of the same system that "graphics shaders" do... then they should both run off of the virtual GPU.

MajorGeneralRelativity commented 7 years ago

But they're not

skyem123 commented 7 years ago

Well... if I was making a "shader" system I'd make both compute and graphics run off the same system. :P

MajorGeneralRelativity commented 7 years ago

@skyem123 But they wouldn't actually run off of the same system. Compute shaders bear no relation to a graphics card.

payonel commented 7 years ago

the only feature actually requested is "parallelize the code" - and we have decided to not support concurrent code execution in the same lua machine state. We also have decided to not support executing code on the gpu (real or software based). Thanks for you interest!

MightyPirates / OpenComputers

Feature Request: GPU Compute #2279