f90 / Wave-U-Net

Implementation of the Wave-U-Net for audio source separation
MIT License
845 stars 177 forks source link

After epoch, <defunct> process remains #4

Closed ghost closed 6 years ago

ghost commented 6 years ago

EPOCH: 1 Starting worker 2018-10-21 14:50:57.740707: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1435] Adding visible gpu devices: 0 2018-10-21 14:50:57.740741: I tensorflow/core/common_runtime/gpu/gpu_device.cc:923] Device interconnect StreamExecutor with strength 1 edge matrix: 2018-10-21 14:50:57.740747: I tensorflow/core/common_runtime/gpu/gpu_device.cc:929] 0 2018-10-21 14:50:57.740751: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 0: N 2018-10-21 14:50:57.740864: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10417 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:04:00.0, compute capability: 6.1) terminate called after throwing an instance of 'std::system_error' what(): Invalid argument terminate called after throwing an instance of 'std::system_error' what(): Invalid argument terminate called after throwing an instance of 'std::system_error' what(): Invalid argument terminate called after throwing an instance of 'std::system_error' what(): Invalid argument terminate called after throwing an instance of 'std::system_error' what(): Invalid argument

num_workers was 6 and when I checked process, there were 5 defunct childerens. Is stop_worker problem?? server is Ubuntu 16.04.5 LTS (GNU/Linux 4.15.0-36-generic x86_64)

f90 commented 6 years ago

Thanks for reporting this.

Can you give a bit more detail on this please? Is one epoch running fine, then this appears on the beginning of the next one? Can you try with just one worker to see whether it works or you get this error only once? What happens after the error? Does it freeze? Does it crash? I am confused by the lack of output after these error messages. Can you pinpoint where this error occurs in my code? Maybe by inserting some print statements at the important steps (before and after stop_workers for example)?

Are you using the Python version and packages as indicated in the Readme (Python2 and packages from requirements.txt)?

ghost commented 6 years ago

It was because of server. Sorry for confusing. Thank you!