[question]Segmentation fault (core dumped) when run python auto_pose/test/aae_retina_webcam_pose.py -test_config aae_retina_webcam.cfg -vis

bigbigdinosaur commented 5 years ago

hello!

in issue 22, i raised a question about what to do after training the aae. According to your helpful reply, i followed the github of keras-retinanet and successfully trained my detect retinanet, which is a h5 document. Then i revised the _aae_retinawebcam.cfg, it is as follows:

[MODEL] gpu_memory_fraction = 0.9 [DATA] color_format = bgr color_data_type = np.float32 depth_data_type = np.float32 [AAE] experiments = ['exp_group/my_autoencoder'] upright = False topk = 1 [DETECTOR] detector_model_path = /home/zelong/Desktop/keras-retinanet-master/snapshots/after.h5 backbone = resnet50 class_names = [0] nms_threshold = 0.5 det_threshold = 0.8 max_detections = 3 #300 [CAMERA] width = 960 height = 720 K_test = [810.4968405 ,0.,487.55096072, 0., 810.61326022 ,354.6674888 , 0., 0., 1.] camPose = False [ICP] icp = False

then i run this code: python aae_retina_webcam_pose.py -test_config aae_retina_webcam.cfg -vis

and the result is: Using TensorFlow backend. 2019-03-24 17:07:37.816778: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA 2019-03-24 17:07:37.909418: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:898] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2019-03-24 17:07:37.910052: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1356] Found device 0 with properties: name: GeForce GTX 1050 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.62 pciBusID: 0000:01:00.0 totalMemory: 3.94GiB freeMemory: 3.48GiB 2019-03-24 17:07:37.910068: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1435] Adding visible gpu devices: 0 2019-03-24 17:07:38.125015: I tensorflow/core/common_runtime/gpu/gpu_device.cc:923] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-03-24 17:07:38.125048: I tensorflow/core/common_runtime/gpu/gpu_device.cc:929] 0 2019-03-24 17:07:38.125073: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 0: N 2019-03-24 17:07:38.125287: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3634 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1050 Ti, pci bus id: 0000:01:00.0, compute capability: 6.1) /home/zelong/.conda/envs/aae/lib/python2.7/site-packages/keras/engine/saving.py:293: _UserWarning: No training configuration found in save file: the model was not compiled. Compile it manually. warnings.warn('No training configuration found in save file: ' /home/zelong/Desktop/path/to/autoencoder_ws/experiments/exp_group/my_autoencoder/myautoencoder.cfg

Start video stream with shape: 640,480 N/A% (0 of 1) | | Elapsed Time: 0:00:00 ETA: --:--:--Segmentation fault (core dumped)

i have tried to find solutions, but failed, may i bother you to help me find what's wrong with this?

the aae and retinanet work properly alone.

Thank you very much!

bigbigdinosaur commented 5 years ago

it seems to go wrong at _renderer = meshrenderer.Renderer(ply_model_paths, samples=1, vertex_tmp_store_folder=get_dataset_path(workspace_path), vertexscale=float(1)) # float(1) for some models which is in line42 of _aae_retina_webcampose.py

MartinSmeyer commented 5 years ago

From the retinanet FAQ, the warning is harmless: "I get the warning UserWarning: No training configuration found in save file: the model was not compiled. Compile it manually., should I be worried? This warning can safely be ignored during inference."

Failing at the Renderer stage is quite strange since you have used it for training..

1.) Did you by chance change the machine or paths/locations of 3D model files?

2.) You can try to reduce the gpu_memory_fraction to 0.7 . Might be that the renderer needs some more space since you only have 3.48GB.

bigbigdinosaur commented 5 years ago

thank you for your reply! i have tried both mothods you metioned, but the problem still exists... i add this code to aae_retina_webcam_pose.py. :_os.environ["CUDA_VISIBLEDEVICES"]="1" ,even i only use cpu, it still go with error.

MartinSmeyer commented 5 years ago

Does it run when you leave out the -vis flag in the command?

bigbigdinosaur commented 5 years ago

yes! but the result is repeated:

float32 (1, 800, 1067, 3) float32 (1, 800, 1067, 3) float32 (1, 800, 1067, 3) float32 (1, 800, 1067, 3) float32 (1, 800, 1067, 3) float32 (1, 800, 1067, 3) float32 (1, 800, 1067, 3) float32 (1, 800, 1067, 3) float32 (1, 800, 1067, 3) float32 (1, 800, 1067, 3) float32 (1, 800, 1067, 3) float32 (1, 800, 1067, 3) float32 (1, 800, 1067, 3) float32 (1, 800, 1067, 3) float32 (1, 800, 1067, 3) float32 (1, 800, 1067, 3)

no other information thank you!

bigbigdinosaur commented 5 years ago

i use spyder2 to run one line after one line when it goes to line42 in _aae_retina_webcampose.py: _renderer = meshrenderer.Renderer(ply_model_paths, samples=1, vertex_tmp_store_folder=get_dataset_path(workspace_path), vertexscale=float(1)) # float(1) for some models kernal die..

MartinSmeyer commented 5 years ago

Then, it is just the visualization of the pose that fails. So if you take a look at the short script and just print all_pose_estimates you should get out the 6D poses.

Can you empty your vertex_tmp_store_folder? So erase everything under $AE_WORKSPACE_PATH/tmp?

Can you check your config file for the vertex_scale parameter? It is hardcoded to 1.0 here (sorry), is it 1000 in your train_config?

bigbigdinosaur commented 5 years ago

in my cfg file, the vertex_scale is 1 not 1000, should i change it to 1000 and train aae again? thank you !

bigbigdinosaur commented 5 years ago

in $AE_WORKSPACE_PATH , i only find _tmpdatasets this folder, after i delete it (rename actually), the problem still exists... i am trying installing different version of tensorflow, but it seems to be useless...

MartinSmeyer commented 5 years ago

Can you run

ae_train exp_group/my_autoencoder -d

again? If you see your model in the resulting image, vertex_scale=1 should be okay... It seems not related to Tensorflow but to the rendering. Can you try to run the rendering isolated by commenting out the rest?

Finally you can try to replace the meshrenderer with the meshrenderer_phong in the script if you trained on a model with vertex color (reconst):

from auto_pose.meshrenderer import meshrenderer_phong

ply_model_paths = [str(train_args.get('Paths','MODEL_PATH')) for train_args in ae_pose_est.all_train_args]

renderer = meshrenderer_phong.Renderer(ply_model_paths, samples=1, vertex_tmp_store_folder=get_dataset_path(workspace_path))

bigbigdinosaur commented 5 years ago

i have run _ae_train exp_group/myautoencoder -d again with _vertexscale=1 and everything is right. i will try the other methods you metioned above thanks for your patience!

bigbigdinosaur commented 5 years ago

Success!! i have not tried runing the rendering isolated by commenting out the rest i downloaded the latest _meshrendererphong and change the code according to your instructions, and it run successfully without error! when the camera detected the object, a green one is over it and the translation is output. i am sorry to bother you so long and it is so nice of you! THANK YOU!

MartinSmeyer commented 5 years ago

Happy that you got it working :-)

DLR-RM / AugmentedAutoencoder

[question]Segmentation fault (core dumped) when run python auto_pose/test/aae_retina_webcam_pose.py -test_config aae_retina_webcam.cfg -vis #23