droid-dataset / droid_policy_learning

DROID Policy Learning and Evaluation
MIT License
148 stars 12 forks source link

Check that: - if dataset was added recently, it may only be available in `tfds-nightly` - the dataset name is spelled correctly - dataset class defines all base class abstract methods - the module defining the dataset class is imported Did you mean: droid -> drop ? The builder directory droid/droid doesn't contain any versions. No builder could be found in the directory: ./droid for the builder: droid. No registered data_dirs were found in: - ./droid #1

Open Jinming-Li opened 8 months ago

Jinming-Li commented 8 months ago

Check that:

Did you mean: droid -> drop ?

The builder directory droid/droid doesn't contain any versions. No builder could be found in the directory: ./droid for the builder: droid. No registered data_dirs were found in:

kpertsch commented 8 months ago

Please make sure that you have actually downloaded the DROID dataset per our instructions in Preprocessing Datasets and that you have changed DATA_PATH to the directory where you downloaded it. Also note that if you downloaded droid_100 instead of the full droid dataset, you need to rename it's folder to droid for things to work out of the box. TFDS will search in DATA_PATH for a folder called droid.

Jinming-Li commented 8 months ago

while I run the code get
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. 2024-03-30 14:42:20.693664: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT An NVIDIA GPU may be present on this machine, but a CUDA-enabled jaxlib is not installed. Falling back to cpu. Traceback (most recent call last): File "/data/private/ljm/droid_policy_learning/robomimic/scripts/train.py", line 37, in import robomimic.utils.train_utils as TrainUtils File "/data/private/ljm/droid_policy_learning/robomimic/utils/train_utils.py", line 22, in import robomimic.utils.file_utils as FileUtils File "/data/private/ljm/droid_policy_learning/robomimic/utils/file_utils.py", line 20, in from robomimic.algo import algo_factory File "/data/private/ljm/droid_policy_learning/robomimic/algo/init.py", line 12, in from robomimic.algo.diffusion_policy import DiffusionPolicyUNet File "/data/private/ljm/droid_policy_learning/robomimic/algo/diffusion_policy.py", line 35, in lang_model.to('cuda') File "/opt/conda/envs/droid_policy_learning/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2179, in to return super().to(*args, **kwargs) File "/opt/conda/envs/droid_policy_learning/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1145, in to return self._apply(convert) File "/opt/conda/envs/droid_policy_learning/lib/python3.10/site-packages/torch/nn/modules/module.py", line 797, in _apply module._apply(fn) File "/opt/conda/envs/droid_policy_learning/lib/python3.10/site-packages/torch/nn/modules/module.py", line 797, in _apply module._apply(fn) File "/opt/conda/envs/droid_policy_learning/lib/python3.10/site-packages/torch/nn/modules/module.py", line 820, in _apply param_applied = fn(param) File "/opt/conda/envs/droid_policy_learning/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1143, in convert return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking) RuntimeError: CUDA error: device kernel image is invalid Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

kpertsch commented 8 months ago

It seems that your torch installation does not work with CUDA, which is likely an issue with how you installed torch and not with the droid_policy_learning repo. Please check whether you can open a python session and the following works without error:

import torch
torch.cuda.is_available()

If not, please debug your torch installation first.

Jinming-Li commented 8 months ago

When I run the training code, I will often be killed after two rounds of training due to insufficient running memory. I want to ask if this part will continue to increase the memory used when running the program.In addition to the small random experience replay set.

kpertsch commented 8 months ago

If you're running low on memory you can try the following:

The first two will make your data loading slower, the third may change the training dynamics if you make the shuffle buffer much smaller, so be careful with that.

Jinming-Li commented 8 months ago

Thanks, I adjusted down the first two items, but as the epoch increases during operation, the amount of running memory is still increasing. What is the reason for this?

kpertsch commented 8 months ago

The reason the memory grows over time is that the TFDS data loader fills buffers to optimize speed -- this is expected. It will eventually plateau but if it maxes out your memory before plateauing you can consider further reducing the parameters above.

Jinming-Li commented 8 months ago

My ram size is 128G, dual card A6000, shuffle_buffer_size is 500000, ram is not enough when testing with droid_100 data set, I would like to ask how much RAM is needed to meet the shuffle_buffer_size size of 500000

kpertsch commented 8 months ago

We train on machines with 300+GB of RAM, but it should be safe to reduce shuffle buffer to 100k, can you test with that?

Jinming-Li commented 8 months ago

I can run it locally and open 100k buffer size. But it’s still strange that when I deploy it to the server, the model of the entire code cannot be imported into the GPU, and an error will be reported directly. https://github.com/droid-dataset/droid_policy_learning/issues/1#issuecomment-2027944163

CRLqinliang commented 7 months ago

@lijinming2018 hi, friend. Did you meet the problem that I face right now? check this #10