dusty-nv / jetson-reinforcement

Deep reinforcement learning GPU libraries for NVIDIA Jetson TX1/TX2 with PyTorch, OpenAI Gym, and Gazebo robotics simulator.
MIT License
874 stars 225 forks source link

Jetpack 2.3 #1

Open S4WRXTTCS opened 7 years ago

S4WRXTTCS commented 7 years ago

Compiling under Jetpack 2.3 resulted in the following error. deepQLearner.cpp:13:18: fatal error: luat.h: No such file or directory.

I did try the pre-built package for Jetpack 2.2, and that seemed to run fine.

dusty-nv commented 7 years ago

That file should have been created during the cmake configuration step. Can you check if it exists in jetson-reinforcement/build/torch/include?

S4WRXTTCS commented 7 years ago

That file, and the TH/THC subfolders aren't there. But, everything else is.

I didn't see any error messages when I did the cmake configuration step. It did take a good while, but nothing unexpected.

DanMcLaughlin commented 7 years ago

Yeah actually the root error is earlier in the cmake phase - typing in by hand since Firefox on 2.3 crashes

In function 'THByteVector_vectorDispatchInit"
simd.h:64:3 error: impossible constraint in 'asm'
asm volatile ( "cpuid\n\t"

lib/TH/CMakeFiles/TH.dir/build.make:350: recipe for target 'lib/TH/CMakeFiles/TH.dir/THVector.c.o' failed

The later error is likely a result of this failure.

dusty-nv commented 7 years ago

See these open issues on Torch7 GitHub regarding the issue: 762 766

For now, I just checked in a (temporary) modification to CMakePreBuild.sh which will checkout a slightly older commit of Torch7 repo.

DanMcLaughlin commented 7 years ago

Thanks Dusty, I removed the build directory and tried again, it gives a slew of errors failing compilation (now it gives the following error after doing a make, but that's just a resultant)

.../c/deepQLearner.cpp:16:21 fatal error: THC/THC.h: no such file or directory

The earlier errors are variations of "cc: internal compiler error: Killed (program cc1plus)" (this is after a reboot) - lots of killed programs while it's trying to build. Usually these occur from some OS issue like limited RAM.

So hum. I'm also getting a killed Firefox, is there some resource limitation in 2.3 causing all these kills?

Otherwise, to try and rebuild is all I need to do is remove the build directory correct?

Thanks -

dusty-nv commented 7 years ago

It appears in a torch.cudnn package update unrelated to 2.3, where a directive was added to compile in parallel with make -J3. Will try patching back to -J1 to reduce the memory usage.

DanMcLaughlin commented 7 years ago

Hi Dustin, any luck?

dusty-nv commented 7 years ago

OK, I was able to get it building again by patching cutorch rockspec in commit 62af1a1 to force jobs -J1 and mounting swap (SATA or SDcard). One of the cutorch tensor source files was consuming all the memory until the compiler was killed (at the time the system was otherwise consuming ~800MB of memory, i.e. normal range). Attached is the build log of it building again with JetPack 2.3 / Ubuntu 16.04. log.txt

S4WRXTTCS commented 7 years ago

The only way I was able to get cmake to work correctly was to do what Dustin recommended.

Doing this was a bit tricky though. What it requires is modifying the cutorch-scm-1.rockspec file under build/cutorch/rocks, and then you have to make it read only. If you don't make it read only it ends up being over-written.

The lines I modified were

Line 27 where I changed jopts=1 Line 29 where I changed jopts=1

It's likely not the best way to do it, but it got the job done.

Summary of steps delete build directory, and recreate it git the following package from the build directory -> https://github.com/torch/cutorch modify cutorch-scm-1.rockspec make the file read-only run the cmake script

As to firefox it's my understanding it doesn't work with Jetpack 2.3

Edit - The latest commit accomplishes the same thing. But, for some reason I didn't have to create a mounting swap.

dusty-nv commented 7 years ago

I did verify in the build log, that cutorch was Building on 1 cores and the CMake script change had taken effect. However, when compiling the TensorMathPointwise files I think it was, OOM killer stepped in until I mounted swap.

Since all of the files are self-contained within the build/ directory, torch/ect. doesn't need compiled from source for each Jetson and could be copied around for JetPack 2.3.

Also note that the torch repo is rolled back to a prior commit right now in the CMake script due to the issues mentioned from this post above.

DanMcLaughlin commented 7 years ago

Thanks guys. Tried it last night and still get kills. I had logged out of the desktop to save memory but apparently I still need swap space.

Dusty any issues with setting up swap? I'm not seeing a lot in a search other than I'd need to recompile the kernel, or has that changed now?

dusty-nv commented 7 years ago

You don't need to recompile L4T kernel, I followed these normal instructions: http://askubuntu.com/a/33703

To tell if it's successful, you should see swap memory appear in /proc/meminfo

DanMcLaughlin commented 7 years ago

OK, I got a 32GB SDCard swap and SSD for building. The box thrashes so badly it's frozen up, but I was able to get a top page which shows some tens or hundreds of cudafe, cc1plusandcudafe++ running. This appears to be during compilatino of cutorch(e.g. THCTensorMathPointwise.cu). Will see if it manages to get through.

OK it got through the cmake! Now on doing a make it fails here ... ``` c/deepQLearner.cpp: In member function 'bool deepQLearner::initLua() c/deepQLearner.cpp:342:51: error: invalid use of incompete type 'struct THCState' printf("]deepRL] cuTorch numDevices: %i\n", THC->numDevices); note forward declartion of 'struct THCState' ``` you guys didn't get this error?
dusty-nv commented 7 years ago

It's because cutorch was updated since my last comment - see cutorch commit 44c5193.

In master, I commented out that line now 32cb67c. There is also a pre-built archive released here for JetPack 2.3: L4T-R24.2-RC1

DanMcLaughlin commented 7 years ago

It works! Thanks Dusty, neat little program

AerialRobotics commented 7 years ago

I could never get this working for JetPack 2.3. The 'cmake' works, but doing 'make' ends up producing the fatal error: "THC/THC.h: No such file or directory. I even tried downloading the pre-built archive 'L4T-R24.2-RC1'. Executing ./deepRL-console hello.lua throws "libluajit.so: cannot open shared object file: No such file or directory. Should I just go back to JetPack 2.2?

dusty-nv commented 7 years ago

OK, I've updated master to build again with latest Torch changes. If you try cloning the repo again, it should work. If you still get the THC/THC.h error, please confirm that build/torch/include/THC/THC.h file is present, otherwise the cmake config script may not have completed correctly.

Regarding the pre-built archive, does it work if you extract the contents as /home/ubuntu/workspace/jetson-inference ?

AerialRobotics commented 7 years ago

The pre-built archive worked once it was executed within the directory you specified above. I was doing it out of /home/ubuntu/jetson-inference. Running the demo I noticed after 400 epochs the wins dramatically decreased from .90 to .50. By 1200 it was back to .90. But then dropped sharply again. Once the algorithm learns, why can't it maintain a high percentage of wins?

DanMcLaughlin commented 7 years ago

@AerialRobotics yeah I noticed this too, have been meaning to dig into the reason. First guess is overfitting and I was going to try saving the model when it reaches 90%+, then switch to inference.

AerialRobotics commented 7 years ago

The build did not go so well. Could not get cmake to even complete. Started receiving many 'Killed' messages. Please see screenshot.
output

gwljf commented 7 years ago

@AerialRobotics I think the reason cause your problem is the memory size. As the talk above, you can add swap. For me, 12GB swap is OK.