Open Jinming-Li opened 8 months ago
Please make sure that you have actually downloaded the DROID dataset per our instructions in Preprocessing Datasets
and that you have changed DATA_PATH
to the directory where you downloaded it.
Also note that if you downloaded droid_100
instead of the full droid
dataset, you need to rename it's folder to droid
for things to work out of the box.
TFDS will search in DATA_PATH
for a folder called droid
.
while I run the code get
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-03-30 14:42:20.693664: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
An NVIDIA GPU may be present on this machine, but a CUDA-enabled jaxlib is not installed. Falling back to cpu.
Traceback (most recent call last):
File "/data/private/ljm/droid_policy_learning/robomimic/scripts/train.py", line 37, in TORCH_USE_CUDA_DSA
to enable device-side assertions.
It seems that your torch installation does not work with CUDA, which is likely an issue with how you installed torch and not with the droid_policy_learning repo. Please check whether you can open a python session and the following works without error:
import torch
torch.cuda.is_available()
If not, please debug your torch installation first.
When I run the training code, I will often be killed after two rounds of training due to insufficient running memory. I want to ask if this part will continue to increase the memory used when running the program.In addition to the small random experience replay set.
If you're running low on memory you can try the following:
The first two will make your data loading slower, the third may change the training dynamics if you make the shuffle buffer much smaller, so be careful with that.
Thanks, I adjusted down the first two items, but as the epoch increases during operation, the amount of running memory is still increasing. What is the reason for this?
The reason the memory grows over time is that the TFDS data loader fills buffers to optimize speed -- this is expected. It will eventually plateau but if it maxes out your memory before plateauing you can consider further reducing the parameters above.
My ram size is 128G, dual card A6000, shuffle_buffer_size is 500000, ram is not enough when testing with droid_100 data set, I would like to ask how much RAM is needed to meet the shuffle_buffer_size size of 500000
We train on machines with 300+GB of RAM, but it should be safe to reduce shuffle buffer to 100k, can you test with that?
I can run it locally and open 100k buffer size. But it’s still strange that when I deploy it to the server, the model of the entire code cannot be imported into the GPU, and an error will be reported directly. https://github.com/droid-dataset/droid_policy_learning/issues/1#issuecomment-2027944163
@lijinming2018 hi, friend. Did you meet the problem that I face right now? check this #10
Check that:
tfds-nightly
Did you mean: droid -> drop ?
The builder directory droid/droid doesn't contain any versions. No builder could be found in the directory: ./droid for the builder: droid. No registered data_dirs were found in: