Check that: - if dataset was added recently, it may only be available in `tfds-nightly` - the dataset name is spelled correctly - dataset class defines all base class abstract methods - the module defining the dataset class is imported Did you mean: droid -> drop ? The builder directory droid/droid doesn't contain any versions. No builder could be found in the directory: ./droid for the builder: droid. No registered data_dirs were found in: - ./droid

Jinming-Li commented 8 months ago

Check that:

if dataset was added recently, it may only be available in tfds-nightly
the dataset name is spelled correctly
dataset class defines all base class abstract methods
the module defining the dataset class is imported

Did you mean: droid -> drop ?

The builder directory droid/droid doesn't contain any versions. No builder could be found in the directory: ./droid for the builder: droid. No registered data_dirs were found in:

./droid

kpertsch commented 8 months ago

Please make sure that you have actually downloaded the DROID dataset per our instructions in Preprocessing Datasets and that you have changed DATA_PATH to the directory where you downloaded it. Also note that if you downloaded droid_100 instead of the full droid dataset, you need to rename it's folder to droid for things to work out of the box. TFDS will search in DATA_PATH for a folder called droid.

Jinming-Li commented 8 months ago

while I run the code get
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. 2024-03-30 14:42:20.693664: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT An NVIDIA GPU may be present on this machine, but a CUDA-enabled jaxlib is not installed. Falling back to cpu. Traceback (most recent call last): File "/data/private/ljm/droid_policy_learning/robomimic/scripts/train.py", line 37, in import robomimic.utils.train_utils as TrainUtils File "/data/private/ljm/droid_policy_learning/robomimic/utils/train_utils.py", line 22, in import robomimic.utils.file_utils as FileUtils File "/data/private/ljm/droid_policy_learning/robomimic/utils/file_utils.py", line 20, in from robomimic.algo import algo_factory File "/data/private/ljm/droid_policy_learning/robomimic/algo/init.py", line 12, in from robomimic.algo.diffusion_policy import DiffusionPolicyUNet File "/data/private/ljm/droid_policy_learning/robomimic/algo/diffusion_policy.py", line 35, in lang_model.to('cuda') File "/opt/conda/envs/droid_policy_learning/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2179, in to return super().to(*args, **kwargs) File "/opt/conda/envs/droid_policy_learning/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1145, in to return self._apply(convert) File "/opt/conda/envs/droid_policy_learning/lib/python3.10/site-packages/torch/nn/modules/module.py", line 797, in _apply module._apply(fn) File "/opt/conda/envs/droid_policy_learning/lib/python3.10/site-packages/torch/nn/modules/module.py", line 797, in _apply module._apply(fn) File "/opt/conda/envs/droid_policy_learning/lib/python3.10/site-packages/torch/nn/modules/module.py", line 820, in _apply param_applied = fn(param) File "/opt/conda/envs/droid_policy_learning/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1143, in convert return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking) RuntimeError: CUDA error: device kernel image is invalid Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

kpertsch commented 8 months ago

It seems that your torch installation does not work with CUDA, which is likely an issue with how you installed torch and not with the droid_policy_learning repo. Please check whether you can open a python session and the following works without error:

import torch
torch.cuda.is_available()

If not, please debug your torch installation first.

Jinming-Li commented 8 months ago

When I run the training code, I will often be killed after two rounds of training due to insufficient running memory. I want to ask if this part will continue to increase the memory used when running the program.In addition to the small random experience replay set.

kpertsch commented 8 months ago

If you're running low on memory you can try the following:

decrease the number of parallel data transform threads here: https://github.com/droid-dataset/droid_policy_learning/blob/10c5632733e7e297fd52ae7bca2232f6a2828993/robomimic/scripts/train.py#L148
decrease the number of parallel read threads here: https://github.com/droid-dataset/droid_policy_learning/blob/10c5632733e7e297fd52ae7bca2232f6a2828993/robomimic/scripts/train.py#L149
decrease the shuffle buffer size here: https://github.com/droid-dataset/droid_policy_learning/blob/10c5632733e7e297fd52ae7bca2232f6a2828993/robomimic/scripts/train.py#L127

The first two will make your data loading slower, the third may change the training dynamics if you make the shuffle buffer much smaller, so be careful with that.

Jinming-Li commented 8 months ago

Thanks， I adjusted down the first two items, but as the epoch increases during operation, the amount of running memory is still increasing. What is the reason for this?

kpertsch commented 8 months ago

The reason the memory grows over time is that the TFDS data loader fills buffers to optimize speed -- this is expected. It will eventually plateau but if it maxes out your memory before plateauing you can consider further reducing the parameters above.

Jinming-Li commented 8 months ago

My ram size is 128G, dual card A6000, shuffle_buffer_size is 500000, ram is not enough when testing with droid_100 data set, I would like to ask how much RAM is needed to meet the shuffle_buffer_size size of 500000

kpertsch commented 8 months ago

We train on machines with 300+GB of RAM, but it should be safe to reduce shuffle buffer to 100k, can you test with that?

Jinming-Li commented 8 months ago

I can run it locally and open 100k buffer size. But it’s still strange that when I deploy it to the server, the model of the entire code cannot be imported into the GPU, and an error will be reported directly. https://github.com/droid-dataset/droid_policy_learning/issues/1#issuecomment-2027944163

CRLqinliang commented 7 months ago

@lijinming2018 hi, friend. Did you meet the problem that I face right now? check this #10

droid-dataset / droid_policy_learning