Closed AnandSingh-0619 closed 7 months ago
As discussed offline, using YOLO_sensor was basically replicating the model for each environment which is blowing up GPU VRAM usage and we decided to shift the detector+segmentor code to the policy level which is basically shared by all environments. This will ensure the detector/segmentor is only instantiated once which is a more practical setting
Also, shared the batch job script offline which I think works ?
You can add in comments if you made any other changes as well. If not, please close the issue
https://github.com/AnandSingh-0619/home-robot/blob/79a7742ed4855482bf5cdd6a06429b3d5bea973a/projects/habitat_uncertainity/task/sensors.py#L105C1-L105C57
The YoloPerception class object is created for all the environments individually. Earlier I had issue with initialization of this class which was some internal error of the Yoloworld class.This error is now sorted.
Another issue I am facing now is this . The observation space shared between the sensors in an env only has readings of head_depth and head_panoptic sensor. I have now added head_rgb too. Now if the YOLO_sensor is giving masks, it cannot be shared by other sensors as observation space consists of simulator-> agents-> main_agent-> sim_sensors only. I have made some modification to he mask generation code from our last discussion. Now I am replicating the work of panoptic sensor and labeling each pixel to the corresponding class. In order to share information between sensors i am currently using the space of head_Depth sensor and updating it with mask value.
Now there is no error in code while debugging it for train run type. However I get CUDA out of memory error after running the code for some time. I want to run a batch job but there is some difference in command given by ovmm readme and the format in which run.py is expecting input especially regarding skill to be trained for.
python -u -m habitat_baselines.run \ --exp-config habitat-baselines/habitat_baselines/config/ovmm/rl_skill.yaml \ --run-type train benchmark/ovmm=<skill_name> \ habitat_baselines.checkpoint_folder=data/new_checkpoints/ovmm/<skill_name>
Can you please check my code and help me setting up a batch job?