Open yinoue0426 opened 3 years ago
Hi @aiueosamu,
Your README.md file is really complete with a lot of informations. However, it is not clear what is the set of instructions to execute in order to obtain the results. I think understand that some of the proposed commands call some others but this is not clear without to read the content of the script files.
I suggest to options:
Moreover, could you detail the packages to install on a fresh Ubuntu 18.4 to use Python 3.6.9 and CUDA 9 ?
Thanks for the feedback. I added the shortened version of the README.
As for installing CUDA on a fresh Ubuntu, I've based my docker image on this repository. As Dockerfile is nothing but a sequence of installation commands, I think you can just follow the commands listed here to install CUDA and Python on a Ubuntu18.04.
Hi @aiueosamu,
I followed the instructions in the tldr.md file, but I got stuck while training DeepCrack. I tested the followig commands on two different computer with the provided docker images.
Here are the different steps that I followed :
fork : https://github.com/hitachi-rd-cv/weakly-sup-crackdet git clone https://github.com/Cyril-Meyer/weakly-sup-crackdet git clone https://github.com/tobycheese/9.0-cudnn7-devel-ubuntu18.04 in 9.0-cudnn7-devel-ubuntu18.04 folder : sudo docker build -t cuda9_ubuntu1804 . in weakly-sup-crackdet/docker folder : sudo docker build -t weakly-sup-crackdet .
I got error building weakly-sup-crackdet docker, I made the following changes :
RUN pip3 install scikit-image==0.15.0 pyyaml cython opencv-python==4.1.0.25 futures==3.2.0
ERROR: Could not find a version that satisfies the requirement futures==3.2.0.
->
RUN pip3 install scikit-image==0.15.0 pyyaml cython opencv-python==4.1.0.25 futures==3.1.1
RUN apt install python3-tk
->
RUN apt install -y python3-tk
RUN pip uninstall opencv-python opencv-python-headless opencv-contrib-python
->
RUN pip uninstall -y opencv-python opencv-python-headless opencv-contrib-python
sudo docker run -it --gpus all --mount type=bind,source="$(pwd)",target=/working_dir weakly-sup-crackdet
The following error :
docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]].
ERRO[0000] error waiting for container: context canceled
Got a solution here : https://github.com/NVIDIA/nvidia-docker/issues/1186
I got this error 3 times
dataset [DeepCrackDataset] was created
The number of training images = 60
initialize network with xavier
/usr/local/lib/python3.6/dist-packages/torch/nn/functional.py:52: UserWarning: size_average and reduce args will be deprecated, please use reduction='elementwise_mean' instead.
warnings.warn(warning.format(ret))
model [DeepCrackModel] was created
---------- Networks initialized -------------
[Network G] Total number of parameters : 14.720 M
-----------------------------------------------
create web directory ./checkpoints/aigle_deepcrack_dil1/web...
THCudaCheck FAIL file=/pytorch/aten/src/THC/THCGeneral.cpp line=663 error=11 : invalid argument
Traceback (most recent call last):
File "train.py", line 32, in <module>
model.optimize_parameters(epoch) # calculate loss functions, get gradients, update network weights
File "/working_dir/weakly-sup-crackdet/models/deepcrack/models/deepcrack_model.py", line 111, in optimize_parameters
self.forward() # compute predictions.
File "/working_dir/weakly-sup-crackdet/models/deepcrack/models/deepcrack_model.py", line 74, in forward
self.outputs = self.netG(self.image)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 477, in __call__
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/data_parallel.py", line 121, in forward
return self.module(*inputs[0], **kwargs[0])
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 477, in __call__
result = self.forward(*input, **kwargs)
File "/working_dir/weakly-sup-crackdet/models/deepcrack/models/deepcrack_networks.py", line 58, in forward
conv1 = self.conv1(x)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 477, in __call__
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/container.py", line 91, in forward
input = module(input)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 477, in __call__
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/conv.py", line 301, in forward
self.padding, self.dilation, self.groups)
RuntimeError: cuda runtime error (11) : invalid argument at /pytorch/aten/src/THC/THCGeneral.cpp:663
----------------- Options ---------------
Since the model does not train, the evaluation also crashes. This error seems to be related to the installation of PyTorch, but since it is specified precisely in the dockerfile, I find myself stuck. For Training DeepLab V3+, I have another problem, the "scripts/train.sh" file does not exist.
Do you have any idea how to fix this problem?
@Cyril-Meyer Thank you for the reply. I've updated the dockerfile accordingly. And also thanks for the nvidia-docker info. Seems more like a nvidia-docker issue, but it helps anyways.
As for the /pytorch/aten/src/THC/THCGeneral.cpp:663
issue, it seems that it is caused by the fact that GPU with Turing architecture are not compatible with CUDA9.0 ref, and both setups you mention use GPUs built on Turing architecture. Sorry I was not aware of this point, I have updated the README accordingly.
CUDA9 criteria is imposed by the fact that the DeepCrack repo requires PyTorch 0.4.1 which requires CUDA9, and I cannot do much. I think DeepLab code ran with tensorflow 1.13.1 which uses CUDA10, so maybe you can test for DeepLab with CUDA10.
As for DeepLab, I forgot to add the flags:
./tools/setup_models.sh --deepcrack --deeplab
This should correctly populate the scripts stored in tools/model_supp
and you should be able to find the scripts/train.sh
file now.
Hi @aiueosamu, thank you for the update and precision.
I followed the new instructions in the tldr.md file, but there are still problems. I tested the followig commands on a third computer with the provided docker images.
OK
The evaluation process failed on the three different models.
FileNotFoundError: [Errno 2] No such file or directory: 'checkpoints/aigle_deepcrack_dil1/web/images'
FileNotFoundError: [Errno 2] No such file or directory: 'checkpoints/cfd_deepcrack_dil1/web/images'
FileNotFoundError: [Errno 2] No such file or directory: 'checkpoints/dc_deepcrack_dil1/test_latest/images'
Concerning the missing flags ./tools/setup_models.sh --deepcrack --deeplab
, this is not the source of the problem, the setup_models.sh seems not to use them, and call the python script with both flags in any case.
python3 tools/setup_models.py --deepcrack --deeplab
The setup_models.py script copy the files from "tools/model_supp/deeplab" to "models/deeplab/research/deeplab/". A "scripts folder" is located in "models/deeplab/research/deeplab".
I try two options :
setupFiles('tools/model_supp/deeplab', 'models/deeplab/research/deeplab/')
into setupFiles('tools/model_supp/deeplab', 'models/deeplab/')
to change the copy destination../research/deeplab/scripts/train.sh
None of these options worked, here is the error :
ModuleNotFoundError: No module named 'deeplab'
@Cyril-Meyer Sorry for asking you to do multiple trials and late response, but I think we are close. I've followed tldr.md myself and was able to reproduce your error, I apologize. I've updated the repo with as new commit, and hopefully the following will fix the problems.
First, as for DeepCrack, the error you mentioned (posted below for reference) is raised by a cleanup code in scripts/output_format.py. Please comment out the last line (line 63) in scripts/output_format.py (In the new commit, this line is wrapped with a try-catch).
FileNotFoundError: [Errno 2] No such file or directory: 'checkpoints/dc_deepcrack_dil1/test_latest/images'
In addition, there was an error in models/deepcrack/scripts/test_eval.sh. Line 16 is
scripts/test_deepcrack.sh 0 $MODEL ./datasets/deeprack_detailed ./checkpoints/
but it should really be
scripts/test_deepcrack.sh 0 $MODEL ./datasets/deepcrack_detailed ./checkpoints/
After the above changes, I was able to train the DeepCrack model without any problems. However, I was not able to reproduce the results for the Aigle dataset. It turns out that there are some bugs in the DeepCrack repo which prevented it from producing the correct results for Aigle dataset.
From the new commit, please replace the following files under models/deepcrack/
with files under tools/model_supp/deepcrack/
:
test.py
data/deepcrack_dataset.py
models/networks.py
util/visualizer.py
With this fix, you should be able to run the script without any problems.
If it runs correctly, you should be able to see the output images stored under checkpoints/*_deepcrack_*/sample_imgs/test_output
As for DeepLab, there were two problems. I've updated the Training DeepLab V3+ section of the tldr.md
file for reference.
First, the main directory of DeepLab files is actually under models/deeplab/research/deeplab
, not models/deeplab
, so run the training scripts from there.
Second, the google repo requires the PYTHONPATH enrironment variable to be set correctly. This was causing the ModuleNotFoundError
. Run the following lines to resolve the issue:
cd models/deeplab/research
export PYTHONPATH=$PYTHONPATH:`pwd`:`pwd`/slim
I've also noticed that python eval/micro_eval.py
does not run properly due to some images being in jpg format. Please update eval/utils.py
and eval/feature_extractor.py
.
With these changes, I think you should be able to run the code without problems.
@aiueosamu
I restarted everything from scratch, but the ./tools/download.sh
script no longer works, it fails while processing the DCD dataset.
Here is the returned error :
Aigle
CFD
DCD
Traceback (most recent call last):
File "tools/download.py", line 129, in <module>
processDCD()
File "tools/download.py", line 98, in processDCD
f_img_dname, t_img_dname, cv2.imread, prefix=pre, extension='.jpg')
File "tools/download.py", line 16, in populate
for f_fname in os.listdir(from_dname):
FileNotFoundError: [Errno 2] No such file or directory: 'data/deepcrack_github/dataset/test_img'
This is probably due to recent modifications to the script (e.g. an unzip command is no longer executed).
@Cyril-Meyer
I am really sorry about that, you are right. Please uncomment the unzip line from the tools/download.sh
script (line 16, I believe).
General info
Reviewer feedback
Details Results