chochain / tensorForth

Forth does tensors, in CUDA.
MIT License
27 stars 2 forks source link

Will this run on a Jetson Nano? #1

Open znmeb opened 2 years ago

znmeb commented 2 years ago

I have a Jetson Nano - 4 arm64 cores, 128 CUDA cores and 4 GB of shared RAM, Ubuntu 18.04, CUDA 10.2. Will cueForth run on it?

chochain commented 2 years ago

Ed, hello again, Jetson is a target of great interest for me as well. cueForth is in its alpha state and currently under active development on my desktop GTX 1660 only. I do compiled it with sm5.2 so hopefully can work with Jetson. However, I haven't had a chance playing with one yet. Will keep you posted.

znmeb commented 2 years ago

I have an RTX 3090 and a GTX 1650 Ti too with Windows 11 / WSL :-) ... if you can put your build scripts in the repo it would probably take me a couple of days to get it working on the Jetsons. The downside of Jetsons is that they are now expensive; in terms of price per CUDA core they're not competitive with laptops any more.

chochain commented 2 years ago

Wow, a RTX 3090! You must be very serious about it. 8-) To benchmark the difference between threading models (i.e. token vs subroutine), I'm churning the code right now. The difference on my eForth repo which I've just completed is 2.5x. Too big to pass. Give me a week or so. I'll get back to you once stabilized. BTW, I use Eclipse with CUDA SDK 11.6 on Ubuntu. I think the SDK works even better on Windows. Cheers,

znmeb commented 2 years ago

I think the SDK works even better on Windows. Cheers,

My standard operating procedure for CUDA on Windows is to run everything in WSL Ubuntu 20.04. That was what NVIDIA and Microsoft tested in the Windows Insider Builds for Windows 10 and it's what they support in Windows 11. There's some overhead, but even in Docker their N-body container benchmark gets 20 TFLOPS on the 3090. I don't think some of the HPC tools like the compilers even run on Windows itself.

The other side of that coin is because of Intel "Optane" and secure boot, dual booting with most Linux distros is a bigger hassle than it used to be, so it's a good thing WSL is almost as good as dual booting.

znmeb commented 2 years ago

I don't know if this is relevant or not - I'm still wading through it.

Programming with a Differentiable Forth Interpreter

znmeb commented 2 years ago

I've forked the project and I'm starting on the Jetson testing - repo is https://github.com/AlgoCompSynth/tensorForth. For my own sanity I'll be maintaining issues / project details over there. Meanwhile the answer to the original question, "Will this run on the Jetson Nano?" depends on whether it needs CUDA 11 or can run with the version of CUDA on the Nano, 10.2. Is CUDA 11 a hard requirement?

For the moment I'm testing on a Jetson Xavier NX - 6 ARM Cortex cores, 8 GB of RAM and 384 CUDA cores. It also has tensor cores if tensorForth can use them. It's running Jetpack 5.0 developer preview, which has CUDA 11.4.

chochain commented 2 years ago

Ed, I was about to send a mail to you but apparently you're well ahead. Very cool!

Since upgraded to Ubuntu 20.04, I lost CUDA 10.2. So, aside from CUDA 11.6, I have not tried it on anything else. Would be exciting to learn if it compiles on your system.

BTW, though the name implies tensor but no implementation at all yet as you can see in my long TODO list. I have just started reading to figure out how torch+numpy stores them. I'll probably start with CUTLASS and CUB. Would enjoy hearing your recommendations!

chochain commented 2 years ago

From 'The differentiable Forth': ∂4 can learn to fill the sketch and generalize well to problems of unseen size. Wow! I haven't contemplated that far yet, but it does open a window (or door). Thank you for the link.

znmeb commented 2 years ago

Ed, I was about to send a mail to you but apparently you're well ahead. Very cool!

Since upgraded to Ubuntu 20.04, I lost CUDA 10.2. So, aside from CUDA 11.6, I have not tried it on anything else. Would be exciting to learn if it compiles on your system.

BTW, though the name implies tensor but no implementation at all yet as you can see in my long TODO list. I have just started reading to figure out how torch+numpy stores them. I'll probably start with CUTLASS and CUB. Would enjoy hearing your recommendations!

I've got Eclipse installed but I've never used it - I'm a command-line geek. :-) What options do you use on the import to get the build tools set up?

chochain commented 2 years ago

I prefer command-line as well. Am actually an Emacs guy, but single stepping gdb in IDE works a bit better. It would be easier if I have a Makefile. However, for the time being,

  1. Install CUDA SDK by follow the instruction in https://docs.nvidia.com/cuda/nsightee-plugins-install-guide/index.html
  2. Start Eclipse, set your workspace to your development directory
    • /home/chochain/devel in my case
  3. Create a project: File=>Open Project from File System=>pick your ten4 repo directory
    • /home/chochain/devel/forth/ten4
  4. Tune nvcc options: File=>Properties=>C/C++ Build=>Settings=>NVCC Compiler with following
    • Dialect => C++14
    • CUDA => Generate SM 5.2 SASS, Generate SM 5.2 PTX
  5. To compile: Project=>Build Project, this should get through the compilation
    • /home/chochain/devel/forth/ten4/Debug/ten4 is created in my case. This is the executable.

Let me know whether it work for you while I try to build a Makefile for the job.

znmeb commented 2 years ago

Thinks are looking pretty good:

https://github.com/AlgoCompSynth/tensorForth/blob/master/agx-xavier-make.log https://github.com/AlgoCompSynth/tensorForth/blob/master/agx-xavier-test.log https://github.com/AlgoCompSynth/tensorForth/blob/master/xavier-nx-make.log https://github.com/AlgoCompSynth/tensorForth/blob/master/xavier-nx-test.log https://github.com/AlgoCompSynth/tensorForth/blob/master/nano-make.log https://github.com/AlgoCompSynth/tensorForth/blob/master/nano-test.log

I know how to automatically determine the compute capability; the next version can pass that as a parameter. It requires building one of the CUDA sample programs called "devioeQuery".

chochain commented 2 years ago

Ed, thank you very much and hope you got some fun out of it, too. 8-) Now we know it might work for other Nvidia chips, too. Seeing them run through Dr. Ting's test cases is very satisfying. Here're the issues I noticed:

znmeb commented 2 years ago

The compute capacity code is done and pushed to the repo - see https://github.com/AlgoCompSynth/tensorForth/blob/master/set-envars.sh. It should work on any Linux with the CUDA samples installed. But there are probably better ways to query the device(s) in C++; I'm a shell programmer so I did it with grep and sed. :-)

Also, it works on Windows 11 / WSL / Ubuntu 20.04 now on my RTX 3090. I have a laptop dual-booted Windows 11 and Ubuntu 20.04 with a GTX 1650 but I'm assuming it'll work there too.

znmeb commented 2 years ago

Now that I think of it, you can probably find the C++ way to query the devices in the PyTorch source; they have a function that lists the devices. Of course, they might be doing it via pycudaor something like that.

znmeb commented 2 years ago

Rebuild on my desktop (the one with the 3090) for release 2.0.0 is in progress. It doesn't look like CUTLASS is in the repositories so I'm building it from source. It's taking a while and I'm only building for the GPU on this machine.

Once I get this done I'll attempt it on a Jetson with JetPack 5.0. If it will work with CUDA 10.2 I'll try it with JetPack 4.6.1, but I doubt if it will work on the Nano - it only has 4 GB of RAM and it's an ancient GPU.

chochain commented 2 years ago

Ed, your enthusiasm encourages me to plow forward. Thank you. The inclusion of CUTLASS was a residual of my early GEMM lib testing with command line options. It escaped my attention when dropping the need of the same. I'm reworking opt.h right now so that people would not need to have CUTLASS on the system and hopefully lower system requirement.