bethgelab / adversarial-vision-challenge

NIPS Adversarial Vision Challenge
https://www.crowdai.org/challenges/nips-2018-adversarial-vision-challenge
41 stars 12 forks source link

avc-test-model-against-attack fails when using transfer_untargeted_attack_baseline #35

Closed csy530216 closed 6 years ago

csy530216 commented 6 years ago

Hello! Although I know avc-test-model-against-attack is not published yet, after reading the code, the problem still seems strange to me: When I run

./avc-test-model-against-attack --model_directory resnet18_model_baseline/ --attack_directory transfer_untargeted_attack_baseline/ --model_gpu 3 --attack_gpu 3 --no-time-limit

(I copied avc-test-model-against-attack and cloned the baselines to the working directory.)

The following error message shows:

  0%|                                                                                                          | 0/100 [00:00<?, ?it/s]
Your container stopped running before all images
                were processed. This either means that the attack was not
                able to produce adversarials for all samples or that the
                attack stopped because of runtime errors.

Traceback (most recent call last):
  File "./avc-test-model-against-attack", line 332, in <module>
    model_gpu=args.model_gpu, attack_gpu=args.attack_gpu, mode=args.mode, samples=args.samples, no_time_limit=args.no_time_limit)
  File "./avc-test-model-against-attack", line 251, in test_attack
    len(result_files), len(test_samples)))
RuntimeError: The attack produced results for less then 50\% of the samples (0/100).

The error also occurs when using iterative_transfer_untargeted_attack_baseline, but does not occur using boundary-untargeted-attack-baseline or saltnpepper_untargeted_attack_baseline.

jonasrauber commented 6 years ago

@csy530216 can you report the docker logs for the attack container? running model and attack on the same GPU might also be a problem

@bveliqi can yo look into this when you are back

csy530216 commented 6 years ago

@jonasrauber Yes you are right! When running model and attack on two different GPUs the problem disappears. When running them on the same GPU, the result of docker logs avc_test_attack_submission:

Running attack... Wating for model server to start... 2018-08-22 09:43:09.767651: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA 2018-08-22 09:43:10.302564: I tensorflow/core/common_runtime/gpu/gpudevice.cc:1356] Found device 0 with properties: name: GeForce GTX TITAN X major: 5 minor: 2 memoryClockRate(GHz): 1.076 pciBusID: 0000:03:00.0 totalMemory: 11.93GiB freeMemory: 483.81MiB_ 2018-08-22 09:43:10.302612: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1435] Adding visible gpu devices: 0 2018-08-22 09:43:10.768711: I tensorflow/core/common_runtime/gpu/gpu_device.cc:923] Device interconnect StreamExecutor with strength 1 edge matrix: 2018-08-22 09:43:10.768774: I tensorflow/core/common_runtime/gpu/gpu_device.cc:929] 0 2018-08-22 09:43:10.768786: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 0: N 2018-08-22 09:43:10.769619: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 202 MB memory) -> physical GPU (device: 0, name: GeForce GTX TITAN X, pci bus id: 0000:03:00.0, compute capability: 5.2) INFO:tensorflow:Restoring parameters from /home/shuyu/resnet18/checkpoints/model/model.ckpt-5865 Restoring parameters from /home/shuyu/resnet18/checkpoints/model/model.ckpt-5865 2018-08-22 09:43:18.004022: E tensorflow/stream_executor/cuda/cuda_dnn.cc:455] could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR 2018-08-22 09:43:18.004092: F tensorflow/core/kernels/conv_ops.cc:713] Check failed: stream->parent()->GetConvolveAlgorithms( conv_parameters.ShouldIncludeWinogradNonfusedAlgo(), &algorithms) /prd/run.sh: line 2: 7 Aborted (core dumped) python3 -u ./main.py

Maybe the reason is that since model container runs first, it consumes nearly all the GPU memory, which causes the error occurred in attack container.

csy530216 commented 6 years ago

The reason why when using the same GPU for model and attack, transfer_untargeted_attack_baseline fails while boundary_untargeted_attack_baseline does not is that transfer attack runs a TensorFlow model which costs GPU memory. Hence if we want to utilize a single GPU card to run avc-test-model-against-attack, we should limit the GPU memory used by the model. Modifying fmodel.py in resnet18_model_baseline:

gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.5)
sess = tf.Session(graph=graph, config=tf.ConfigProto(gpu_options=gpu_options))

solves the problem.

jonasrauber commented 6 years ago

That might be a workaround, but we recommend to use two GPUs as we do.