aidudezzz / deepworlds

Examples and use cases using the deepbots framework (https://github.com/aidudezzz/deepbots) with the Webots robot simulator.
GNU General Public License v3.0
50 stars 23 forks source link

Training results not saved or files cannot be accessed #106

Closed wayne-weiwei closed 1 month ago

wayne-weiwei commented 1 month ago

Hi there, Thank you for providing such a great tool! I successfully trained the model find_and_avoid_v2 and observed multiple results during the process. However, I noticed that none of these results seem to be saved into a file, or I might be looking in the wrong place and cannot access the generated files.

I would really appreciate your help with the following:

  1. Are the training results supposed to be automatically saved into a file?
  2. If so, where can I find these files, or how should I access them?
  3. Is there any specific configuration I need to enable to ensure the results are saved?

Thank you so much in advance for your assistance, and I’m looking forward to your guidance!

tsampazk commented 1 month ago

Hey there @wayne-weiwei!

What kind of results do you expect to be saved? There is a great deal of metrics being logged with tensorboard for find_and_avoid_v2, take a look at the relevant README section.

Let me know if this covers your questions or you have any additional ones.

wayne-weiwei commented 1 month ago

Thanks for the reply. When I followed the file

Tensorboard is used for logging various aspects of the training procedure. To watch the tensorboard logs, navigate to /deepworlds/examples/find_and_avoid_v2/controllers/robot_supervisor_manager and run tensorboard --logdir ./experiments/.

I would meet this error :

~/webots/projects/deepworlds-dev/examples/find_and_avoid_v2/controllers/robot_supervisor_manager$ tensorboard --logdir ./experiments/
2024-10-03 15:06:50.490971: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-10-03 15:06:50.975236: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
2024-10-03 15:06:51.511356: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-10-03 15:06:51.542694: W tensorflow/core/common_runtime/gpu/gpu_device.cc:2251] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
W1003 15:06:51.564800 135249606198336 server_ingester.py:187] Failed to communicate with data server at localhost:40979: <_InactiveRpcError of RPC that terminated with:
    status = StatusCode.UNAVAILABLE
    details = "failed to connect to all addresses; last error: UNKNOWN: ipv4:130.209.6.40:8080: HTTP proxy returned response code 503"
    debug_error_string = "UNKNOWN:Error received from peer  {grpc_message:"failed to connect to all addresses; last error: UNKNOWN: ipv4:130.209.6.40:8080: HTTP proxy returned response code 503", grpc_status:14, created_time:"2024-10-03T15:06:51.564631055+01:00"}"
>
Serving TensorBoard on localhost; to expose to the network, use a proxy or pass --bind_all
TensorBoard 2.16.2 at http://localhost:6006/ (Press CTRL+C to quit)

Thank you again for your help.

tsampazk commented 1 month ago

Hmm it seems that you have several warnings that look like system-related. The last line indicates that tensorboard indeed runs on http://localhost:6006/. What happens when you visit that url while tensorboard is running?

The last warning shows error related to your network. Try using tensorboard --logdir ./experiments/ --host localhost --port 8088, if it doesn't work.

wayne-weiwei commented 1 month ago

Thank you very much for your help. I was able to obtain the final results of the model after modifying the system settings. Could you please let me know how I can record a video of the better cases during the model's training and testing?

tsampazk commented 1 month ago

Happy to help @wayne-weiwei! For recording you can use Webots built-in recording tool, check the documentation here.