BPBNet - Githubissues

GanyongMo commented 8 months ago

Hello Henry,

I am confused the model training for the BPBNet, do you have any idea coming out according to the following issues what I met?

I followed the command: _python3.6 train_BPXnet.py --X_is 'B' --slp 'mixedreal' --train_onlybetanet it rose the error: _File "../lib_py/tensorprep_lib_bp.py", line 151, in prep_reconstruction_gt x[im_ct, start_map_idx, :, :] = dat['mesh_depth'][entry].astype(np.float32) KeyError: 'mesh_depth'_ also, I skipped the first step, continue the second and the third step, then followed the fourth step for the BPBNet, _python train_BPXnet.py --Xis 'B' --mod 2 --slp 'mixedreal' --v2v it rose the error: _File "../lib_py/tensorprep_lib_bp.py", line 170, in prep_reconstruction_input_est x[im_ct, start_map_idx, :, :] = dat['pimg_est'][entry].astype(np.float32) KeyError: 'pimg_est'_

I found the program read the dataset correctly, but idk what I missed... Is it necessary to get the BodyPressureSD addendum dataset (148G) to setup the training for BPBNet?

Thanks in advance.

Best Regards, Ganyong

henryclever commented 7 months ago

Hi Ganyong,

I'll have to try this again and repro your issue to see - I haven't looked at this code in some time and don't have enough information to know exactly what is causing it. The steps should work out of the box if you have the data downloaded in the correct folder. Are you sure you downloaded all the data (except the addendum)?

You definitely don't need the addendum dataset for training this.

Did you try it with --X_is 'W'? if so did that work?

GanyongMo commented 7 months ago

Hi Ganyong,

I'll have to try this again and repro your issue to see - I haven't looked at this code in some time and don't have enough information to know exactly what is causing it. The steps should work out of the box if you have the data downloaded in the correct folder. Are you sure you downloaded all the data (except the addendum)?

You definitely don't need the addendum dataset for training this.

Did you try it with --X_is 'W'? if so did that work?

Hi Henry,

So glad to hear from you!!

Yes, I already download all the data (except the addendum, that is why I asked if I need it to implement the basic training processes both black-box NN and white-box NN, this dataset is too large, I am afraid it is impossible for downloading it in my case, but now it is clear for me)

I am trying to train with --X is 'W' now, in the 2/4 step, I would update the information here once implemented.

For the problem I met previously (--X is 'B'), I am sure the dataset has corresponding 'mesh_depth' and 'pimg_est', but when the function calls the dataset, it shows missing these two keys.

By the way, it is a little bit suffering to check the codes (time-comsuming) because the VSCode cannot run the debugger with python==3.6 (for the moment the VSCode just supports python3.8 or higher version), do you have any idea for it? (if time allowed, I would like to try to configure the environment in python3.8 for the code running correctly).

Anyway, let's keep in touch.

Best Regards, Ganyong

GanyongMo commented 7 months ago

By the way, it is a little bit suffering to check the codes (time-comsuming) because the VSCode cannot run the debugger with python==3.6 (for the moment the VSCode just supports python3.8 or higher version), do you have any idea for it? (if time allowed, I would like to try to configure the environment in python3.8 for the code running correctly).

This problem I have solved, we need to configure the debugger environment appropriately.

Did you try it with --X_is 'W'? if so did that work?

it is working for me; I would try again to figure out what is the problem of the case "--X_is 'B' ". thank you so much!

Best Regards, Ganyong

henryclever commented 7 months ago

Ah! Glad you got the problem solved with the debugger and envt.

I'm glad the --X_is 'W' is working. Please let me know if you have this issue again with -X_is 'B'! If it is a bug in the code I will make sure it gets fixed for you as soon as possible.

-Henry

GanyongMo commented 7 months ago

Please let me know if you have this issue again with -X_is 'B'! If it is a bug in the code I will make sure it gets fixed for you as soon as possible.

Hi, unfortunately, I tried again today, the problem is still there and the same as aforementioned (-X_is 'B'). I am not sure if you have the same problem when you run it again. I am also trying to figure it out.

-- Ganyong

henryclever commented 7 months ago

OK - i'm downloading the data now and will try . Could you send me the contents of the danaLab data? It's on a computer from my old lab I do longer have access to. It may be quicker for me to get them from you than go through and request from AC Lab again. send to (either direct or through some link, i don't care): hclever@nvidia.com. otherwise let me know and I'll request from AC Lab.

Thanks!

henryclever commented 7 months ago

@GanyongMo, thanks for sending.

There is definitely a bug in this for step 1 ... I repro'd it and found the same issue. As a workaround in the interim, just train betanet for the "B" network using the "W" flag in step 1. In practice the betanet trains the same way (and this doesn't make a difference) but should be corrected (I will fix it) .

By the way I'm using the following package versions:

python 3.6.9
numpy 1.20.3
trimesh 3.8.19
pyrender 0.1.45
pillow 8.1.0
sudo apt install libjpeg-dev zlib1g-dev
matplotlib 3.3.4
torch 1.7.1
torchvision 0.8.2
chumpy 0.70
opencv-python 4.5.1.48
scikit-learn 0.23.2
open3d-python 0.7.0.0
imutils 0.5.4
camera 1.3.0
imageio 2.9.0

I just ran into the following error: "RuntimeError: CUDA error: no kernel image is available for execution on the device" -- and I need to get past this to get step 2 working so I can get to step 4. what cuda driver version are you using (e.g. did you have to downgrade to...)?

Henry

henryclever commented 7 months ago

maybe if you are using latest cuda headers you can send me your versions and I can try with those instead of my old ones? My computer has 535 installed on an A6000.

GanyongMo commented 7 months ago

@henryclever

I just ran into the following error: "RuntimeError: CUDA error: no kernel image is available for execution on the device" -- and I need to get past this to get step 2 working so I can get to step 4

Yep, I got the same error as well at the beginning, the reason is that the torch and cuda were not compatible, the packages included corresponding versions that I am using as following below:

bodypressure.yaml

maybe if you are using latest cuda headers you can send me your versions and I can try with those instead of my old ones?

This is the cuda headers version and corresponding pytorch that I am using, it is also the solution for the previous RuntimeError for me.

CUDA 11.1 conda install pytorch==1.8.0 torchvision==0.9.0 torchaudio==0.8.0 cudatoolkit=11.1 -c pytorch -c conda-forge

the link: https://pytorch.org/get-started/previous-versions/

At the end, I think I already can run all the commands of 4 steps successfully after built a specific branch for the condition "X_is 'B' and mod =2". Now I am running them to verify the results. If it is possible, I can share it with you then you can have a check if this is on the right track when you are available (maybe I can create a branch for github repo or other way, it is totally ok for me, let me know which one is better for you)

--Ganyong

henryclever commented 7 months ago

Sure! happy to check - just create a branch and I'll take a look. :)

-Henry

Healthcare-Robotics / BodyPressure

BPBNet #19