train - Githubissues

henry1-0 commented 2 months ago

Hello, when I run the training code python3 train.py --config ./config/detectiondiffusion.py, it shows:

(affpose) henry@yh:~/project/Language-Conditioned-Affordance-Pose-Detection-in-3D-Point-Clouds-main$ python3 train.py --config ./config/detectiondiffusion.py Traceback (most recent call last): File "/home/henry/anaconda3/envs/affpose/lib/python3.8/site-packages/urllib3/connection.py", line 196, in _new_conn sock = connection.create_connection( File "/home/henry/anaconda3/envs/affpose/lib/python3.8/site-packages/urllib3/util/connection.py", line 85, in create_connection raise err File "/home/henry/anaconda3/envs/affpose/lib/python3.8/site-packages/urllib3/util/connection.py", line 73, in create_connection sock.connect(sa) OSError: [Errno 101] Network is unreachable

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/home/henry/anaconda3/envs/affpose/lib/python3.8/site-packages/urllib3/connectionpool.py", line 789, in urlopen response = self._make_request( File "/home/henry/anaconda3/envs/affpose/lib/python3.8/site-packages/urllib3/connectionpool.py", line 490, in _make_request raise new_e File "/home/henry/anaconda3/envs/affpose/lib/python3.8/site-packages/urllib3/connectionpool.py", line 466, in _make_request self._validate_conn(conn) File "/home/henry/anaconda3/envs/affpose/lib/python3.8/site-packages/urllib3/connectionpool.py", line 1095, in _validate_conn conn.connect() File "/home/henry/anaconda3/envs/affpose/lib/python3.8/site-packages/urllib3/connection.py", line 615, in connect self.sock = sock = self._new_conn() File "/home/henry/anaconda3/envs/affpose/lib/python3.8/site-packages/urllib3/connection.py", line 211, in _new_conn raise NewConnectionError( urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPSConnection object at 0x7f0a118f27c0>: Failed to establish a new connection: [Errno 101] Network is unreachable

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/home/henry/anaconda3/envs/affpose/lib/python3.8/site-packages/requests/adapters.py", line 667, in send resp = conn.urlopen( File "/home/henry/anaconda3/envs/affpose/lib/python3.8/site-packages/urllib3/connectionpool.py", line 843, in urlopen retries = retries.increment( File "/home/henry/anaconda3/envs/affpose/lib/python3.8/site-packages/urllib3/util/retry.py", line 519, in increment raise MaxRetryError(_pool, url, reason) from reason # type: ignore[arg-type] urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /laion/CLIP-ViT-B-32-laion2B-s34B-b79K/resolve/main/open_clip_pytorch_model.bin (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f0a118f27c0>: Failed to establish a new connection: [Errno 101] Network is unreachable'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/henry/anaconda3/envs/affpose/lib/python3.8/site-packages/huggingface_hub/file_download.py", line 1751, in _get_metadata_or_catch_error metadata = get_hf_file_metadata( File "/home/henry/anaconda3/envs/affpose/lib/python3.8/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn return fn(args, kwargs) File "/home/henry/anaconda3/envs/affpose/lib/python3.8/site-packages/huggingface_hub/file_download.py", line 1673, in get_hf_file_metadata r = _request_wrapper( File "/home/henry/anaconda3/envs/affpose/lib/python3.8/site-packages/huggingface_hub/file_download.py", line 376, in _request_wrapper response = _request_wrapper( File "/home/henry/anaconda3/envs/affpose/lib/python3.8/site-packages/huggingface_hub/file_download.py", line 399, in _request_wrapper response = get_session().request(method=method, url=url, params) File "/home/henry/anaconda3/envs/affpose/lib/python3.8/site-packages/requests/sessions.py", line 589, in request resp = self.send(prep, send_kwargs) File "/home/henry/anaconda3/envs/affpose/lib/python3.8/site-packages/requests/sessions.py", line 703, in send r = adapter.send(request, kwargs) File "/home/henry/anaconda3/envs/affpose/lib/python3.8/site-packages/huggingface_hub/utils/_http.py", line 66, in send return super().send(request, args, **kwargs) File "/home/henry/anaconda3/envs/affpose/lib/python3.8/site-packages/requests/adapters.py", line 700, in send raise ConnectionError(e, request=request) requests.exceptions.ConnectionError: (MaxRetryError("HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /laion/CLIP-ViT-B-32-laion2B-s34B-b79K/resolve/main/open_clip_pytorch_model.bin (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f0a118f27c0>: Failed to establish a new connection: [Errno 101] Network is unreachable'))"), '(Request ID: 51fc0566-df8b-45c3-8975-b63baeb24a4d)')

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "train.py", line 4, in from utils import File "/home/henry/project/Language-Conditioned-Affordance-Pose-Detection-in-3D-Point-Clouds-main/utils/init.py", line 1, in from .builder import build_optimizer, build_dataset, build_loader, build_model File "/home/henry/project/Language-Conditioned-Affordance-Pose-Detection-in-3D-Point-Clouds-main/utils/builder.py", line 4, in from models import File "/home/henry/project/Language-Conditioned-Affordance-Pose-Detection-in-3D-Point-Clouds-main/models/init.py", line 1, in from .main_nets import DetectionDiffusion File "/home/henry/project/Language-Conditioned-Affordance-Pose-Detection-in-3D-Point-Clouds-main/models/main_nets.py", line 8, in text_encoder = TextEncoder(device=torch.device('cuda')) File "/home/henry/project/Language-Conditioned-Affordance-Pose-Detection-in-3D-Point-Clouds-main/models/components.py", line 39, in init self.clipmodel, , _ = open_clip.create_model_and_transforms("ViT-B-32", pretrained="laion2b_s34b_b79k", File "/home/henry/anaconda3/envs/affpose/lib/python3.8/site-packages/open_clip/factory.py", line 399, in create_model_and_transforms model = create_model( File "/home/henry/anaconda3/envs/affpose/lib/python3.8/site-packages/open_clip/factory.py", line 298, in create_model checkpoint_path = download_pretrained(pretrained_cfg, cache_dir=cache_dir) File "/home/henry/anaconda3/envs/affpose/lib/python3.8/site-packages/open_clip/pretrained.py", line 653, in download_pretrained target = download_pretrained_from_hf(model_id, cache_dir=cache_dir) File "/home/henry/anaconda3/envs/affpose/lib/python3.8/site-packages/open_clip/pretrained.py", line 623, in download_pretrained_from_hf cached_file = hf_hub_download(model_id, filename, revision=revision, cache_dir=cache_dir) File "/home/henry/anaconda3/envs/affpose/lib/python3.8/site-packages/huggingface_hub/utils/_deprecation.py", line 101, in inner_f return f(*args, *kwargs) File "/home/henry/anaconda3/envs/affpose/lib/python3.8/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn return fn(args, **kwargs) File "/home/henry/anaconda3/envs/affpose/lib/python3.8/site-packages/huggingface_hub/file_download.py", line 1240, in hf_hub_download return _hf_hub_download_to_cache_dir( File "/home/henry/anaconda3/envs/affpose/lib/python3.8/site-packages/huggingface_hub/file_download.py", line 1347, in _hf_hub_download_to_cache_dir _raise_on_head_call_error(head_call_error, force_download, local_files_only) File "/home/henry/anaconda3/envs/affpose/lib/python3.8/site-packages/huggingface_hub/file_download.py", line 1857, in _raise_on_head_call_error raise LocalEntryNotFoundError( huggingface_hub.utils._errors.LocalEntryNotFoundError: An error happened while trying to locate the file on the Hub and we cannot find the requested files in the local cache. Please check your connection and try again or make sure your Internet connection is on.

This error message indicates that Hugging Face Hub failed while trying to find the required file from its local cache. If the file is not cached locally, it will try to download the required file from Hugging Face Hub, but the download may fail for various reasons (e.g. network issues, wrong model name, etc.). Please tell me what I should do. thank you

toannguyen1904 commented 2 months ago

Hi @henry1-0, Seems like there have been some problems with your internet connection. Potential solutions include:

Check your internet connection: By ping google.com for example
Check if there are firewalls or proxy settings. If you are working on a network environment with restrictions, make sure you have set HTTP_PROXY and HTTPS_PROXY. Hope this can help.

Best regards, Toan.

henry1-0 commented 2 months ago

你好@henry1-0，您的互联网连接似乎出现了一些问题。可能的解决方案包括：

检查你的互联网连接：ping google.com例如

检查是否有防火墙或代理设置。如果您在有限制的网络环境中工作，请确保已设置HTTP_PROXY和HTTPS_PROXY。希望这能有所帮助。

谨致问候， Toan。

Hello Thank you very much for your reply. At present, I have successfully run this model according to your suggestion, but only a blank page is displayed in the visualization results, and no visualization results are seen. I have not found the reason yet. Can you answer this question?

toannguyen1904 commented 2 months ago

@henry1-0, Have you generated the result.pkl file?

henry1-0 commented 2 months ago

@henry1-0, Have you generated the result.pkl file?

Yes, I have generated the 'result.pkl' file and I looked at the output of this file.

toannguyen1904 commented 2 months ago

Hi @henry1-0. It looks like the model is under-performant. How many epochs have you trained the model? Currently, some pose values are too high while others are too low. This results in the camera zooming prohibitively far from the point cloud object and causes the "disappearance" and blank window as you are currently experiencing.

henry1-0 commented 2 months ago

你好@henry1-0。看起来模型性能不佳。您训练了多少个周期的模型？目前，一些姿势值太高，而另一些姿势值太低。这导致相机变焦远离点云对象，并导致您目前遇到的“消失”和空白窗口。

Yes, I saw that your original model was trained for 200 cycles, I adjusted it to 150 cycles, and adjusted the batch_size to 8 for training. I saw that the batch_size in your original model was 2.

henry1-0 commented 2 months ago

你好@henry1-0。看起来模型性能不佳。您训练了多少个周期的模型？目前，一些姿势值太高，而另一些姿势值太低。这导致镜片聚焦远离点云物体，并导致您目前遇到的“消失”和空白。

是的，看到你原来的模型训练了200个循环，我调整为150个循环，并且调整batch_size为8进行训练，看到你原来的模型里的batch_size是2。

Hello, I will train it again according to the parameters of your original model, and then perform a visualization to see if it still has the same problem.

toannguyen1904 commented 2 months ago

Hi @henry1-0,

Sorry for the mistake possibly occurring while editing for our commit. The batch_size should be 32 (as stated in our paper), and the batch size of 8 is insufficient to reach the desired performance. Training the model following our setting should address the problem of blank screen.

Anyway, if you want to apply a quick modification to observe the rendering results of your current model, consider using torch.clamp() to control the range of the predicted noise as we have lately updated here. Applying torch.clamp() to limit the noise prediction is a common technique in diffusion models' inference process, yet we found our method does not need it if trained properly. Note that this approach requires you to re-generate the result.pkl file.

If the error persists, consider removing poses whose values exceed a specific threshold before showing the scene.

Best regards, Toan.

toannguyen1904 commented 2 months ago

You can also execute this simple code snippet to make sure that the rendering of trimesh works properly.

import trimesh
import numpy as np

# Number of points in the point cloud
num_points = 1000

# Generate random points in 3D space
points = np.random.rand(num_points, 3)

# Create a Trimesh PointCloud object
point_cloud = trimesh.points.PointCloud(points)

# Set the color of the points to blue (RGBA format)
# Blue color in RGBA is (0, 0, 255, 255)
blue_color = np.array([[0, 0, 255, 255]] * num_points)
point_cloud.colors = blue_color
scene = trimesh.Scene([point_cloud])
scene.show()

henry1-0 commented 2 months ago

您还可以执行这个简单的代码片段来确保 trimesh 的渲染正常工作。

import trimesh
import numpy as np

# Number of points in the point cloud
num_points = 1000

# Generate random points in 3D space
points = np.random.rand(num_points, 3)

# Create a Trimesh PointCloud object
point_cloud = trimesh.points.PointCloud(points)

# Set the color of the points to blue (RGBA format)
# Blue color in RGBA is (0, 0, 255, 255)
blue_color = np.array([[0, 0, 255, 255]] * num_points)
point_cloud.colors = blue_color
scene = trimesh.Scene([point_cloud])
scene.show()

Hello，After running this small piece of code, the rendering of trimesh can work normally. As you said, batch_size cannot be set to 32 due to computer configuration reasons. I tried to set it to 16 for training to see its effect.

Best regards, Henry.

henry1-0 commented 2 months ago

Hi @henry1-0,

Sorry for the mistake possibly occurring while editing for our commit. The batch_size should be 32 (as stated in our paper), and the batch size of 8 is insufficient to reach the desired performance. Training the model following our setting should address the problem of blank screen.

Anyway, if you want to apply a quick modification to observe the rendering results of your current model, consider using torch.clamp() to control the range of the predicted noise as we have lately updated here. Applying torch.clamp() to limit the noise prediction is a common technique in diffusion models' inference process, yet we found our method does not need it if trained properly. Note that this approach requires you to re-generate the result.pkl file.

If the error persists, consider removing poses whose values exceed a specific threshold before showing the scene.

Best regards, Toan.

Hello, as you said, I added torch.clamp(), did not retrain, only regenerated the result.pkl file. In the subsequent visualization results, only one point can be found in each result, as shown in the figure.

toannguyen1904 commented 2 months ago

This looks so weird. Please render only the point cloud to see if it works. To do this, change this line to scene = trimesh.Scene([point_cloud]).

henry1-0 commented 2 months ago

这看起来太奇怪了。请仅渲染点云以查看它是否有效。为此，请将此行更改为scene = trimesh.Scene([point_cloud])。

Hello, as you said, after I changed the code in the visualize.py file, the visualization result shows a point cloud map, as shown in the picture.

henry1-0 commented 2 months ago

这看起来太奇怪了。请仅渲染点云以查看它是否有效。为此，请将此行更改为scene = trimesh.Scene([point_cloud])。

Hi, regarding this issue, the possible reason may be that I set the batch_size too small (I set this parameter to 8 during training) and this result. But it is indeed because of the computer configuration that batch_size cannot be set to 32.

Best regards, Henry

joycelimxy commented 2 months ago

Hi @toannguyen1904, thank you for the great work.

I used the default parameters in detectiondiffusion.py with your dataset (full_shape_release.pkl) to train a model by directly calling python3 train.py --config ./config/detectiondiffusion.py

However, I'm unable to get a good result from the trained model, as the visualization for the result.pkl also shows a single dot as experienced by @henry1-0. Upon zooming in, the generated grasp poses by the model are not in contact with the object at all (red poses). The object (black point cloud) is a Vase object from your dataset. Screenshot from 2024-09-25 15-09-30

Would you be able to advise or share your weights file please?

Thank you.

hanyueling commented 1 month ago

Hi @toannguyen1904, thank you for the great work.你好 @toannguyen1904，感谢你的出色工作。

I used the default parameters in detectiondiffusion.py with your dataset (full_shape_release.pkl) to train a model by directly calling python3 train.py --config ./config/detectiondiffusion.py我使用了 detectiondiffusion.py 中的默认参数和您的数据集（full_shape_release.pkl），通过直接调用 python3 train.py --config ./config/detectiondiffusion.py 来训练模型。

However, I'm unable to get a good result from the trained model, as the visualization for the result.pkl also shows a single dot as experienced by @henry1-0. Upon zooming in, the generated grasp poses by the model are not in contact with the object at all (red poses). The object (black point cloud) is a Vase object from your dataset.但是，我无法从训练好的模型中得到好的结果、

Would you be able to advise or share your weights file please?能否请您提供建议或分享您的权重文件？

Thank you.谢谢。

Hello, I also encountered the same problem. I can only see one point in the visualization results. After zooming in, there are also many grasping postures. Have you solved this problem?

joycelimxy commented 1 month ago

Hi @hanyueling, unfortunately I have not managed to solve this issue.

toannguyen1904 commented 1 month ago

Hi everyone, I've just updated the code.

There are some mistakes in the previous version that the sinusoidal positional encoding was not the one we used and it did not work properly. The time step should be encoded using an MLP as in the updated code. Note that with this new code, you do not need the clamp() function anymore. Yet the result would be much better, and the gripper poses can now be observed clearly. Also note that in order to obtain a decent visualization, you should curate poses that are close to the object given a distance threshold.

About the pretrained code, I will update it as soon as possible. But I am not sure when you can have it as the policy of our company is quite restricted. Anyway, retraining the model with the updated code should work well.

Warm regards, Toan.

Fsoft-AIC / Language-Conditioned-Affordance-Pose-Detection-in-3D-Point-Clouds

train #3