csmliu / STGAN

STGAN: A Unified Selective Transfer Network for Arbitrary Image Attribute Editing
MIT License
435 stars 86 forks source link

How to train the attribute classifier to test custom attributes? #34

Closed DateBro closed 4 years ago

DateBro commented 4 years ago

Hi, csmliu, thank you so much for your brilliant work! I want to use STGAN as a baseline in my paper and want to get the attribute classification accuracy of STGAN on some attributes instead of your predefined attributes. But I find it confusing about the required tfrecord data, which is a little different from LynnHo/TfrecordCreator. Just want to avoid potential error, could you give a more detailed tutorial for the training of attribute classifier?

csmliu commented 4 years ago

Hi DateBro, Thanks for your attention.

  1. The provided test model is trained with all 40 attributes provided by CelebA, and you only need to add other attributes into the att_id (https://github.com/csmliu/STGAN/blob/ce592029501724fddc47019cd2b25624ad166c1f/att_classification/test.py#L34-L46), the overall accuracy is similar to (should be slightly higher than) the value reported in the AttGAN paper.
  2. I've found the code to generate the tfrecord file, and I did modify the code for easier use. If you need to train your own attribute classifier, you can use Create_TFRecord.zip to generate a training tfrecord file.

If there are any other problems, please feel free to reopen this repo.

DateBro commented 4 years ago

I tested the commands in https://github.com/csmliu/STGAN/blob/master/att_classification/README.md, but got some trouble in tackling it. When using tensorflow-gpu1.15, I got errors like

(0) Failed precondition: Attempting to use uninitialized value classifier/Conv_2/weights
     [[node classifier/Conv_2/weights/read (defined at /home/zhiyong/.local/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]
     [[Cast/_3]]
  (1) Failed precondition: Attempting to use uninitialized value classifier/Conv_2/weights
     [[node classifier/Conv_2/weights/read (defined at /home/zhiyong/.local/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]
0 successful operations.
0 derived errors ignored.

When using a new virtual environment of tf-gpu1.12 or 1.4, I always got

WARNING:tensorflow:From /home/zhiyong/RemoteServer/pycharm_projects/STGAN/att_classification/tflib/collection.py:62: The name tf.GraphKeys is deprecated. Please use tf.compat.v1.GraphKeys instead.

WARNING:tensorflow:
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.

段错误 (核心已转储

I have no idea what to do, can you give me some advice?

csmliu commented 4 years ago

TF 1.12 should work according to my experience, and should not raise such warnings. Please check that whether you are using the right version by import tensorflow as tf; print(tf.__version__)

csmliu commented 4 years ago

I remember that, the warning about TF 2.0 occurs in TF 1.13 or 1.14, so maybe you are using the wrong version.

DateBro commented 4 years ago

Sorry, I forgot some features of Anaconda and use TF 1.15 as TF 1.12. But when I used TF 12 correctly, there is still the same error as TF 1.15.

tensorflow.python.framework.errors_impl.FailedPreconditionError: Attempting to use uninitialized value classifier/Conv/weights
     [[Node: classifier/Conv/weights/read = Identity[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"](classifier/Conv/weights)]]
     [[Node: Cast/_3 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_228_Cast", tensor_type=DT_INT64, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
csmliu commented 4 years ago

Pretty weird error. Have you modified the network architecture?

BTW, to make sure that the environment is correctly set, it's better to create the virtual environment by

conda create -n NAME tensorflow-gpu=1.12

That is, assigning the tf version when creating the environment (in this way, anaconda will automatically install packages which will not cause conflicts, e.g., an older version of python.)

csmliu commented 4 years ago

Please refer to https://stackoverflow.com/questions/44624648/tensorflow-attempting-to-use-uninitialized-value-in-variable-initialization

DateBro commented 4 years ago

I only modified basic.py

return imageio.imread(path, pilmode=mode) / 127.5 - 1

and set the test_tfrecord_path = './tfrecords/test' to absolute path, because when I use relative path I got no such file or directory. I run the test.py on Ubuntu18.04 with RTX2070, should I switch to windows? It seems that your train.py was run in Windows. I am still confused about the advice in stackoverflow, shouldn't test.py just read in the checkpoint file and predict?

DateBro commented 4 years ago

Thanks for your detailed help, I can run the test after add the code in StackOverflow. 👍

DateBro commented 4 years ago

I got different accuracy by the command

python test.py --experiment_name 128 --test_int 2 --dataroot mydataroot
python att_classification/test.py --img_dir ./output/128/sample_testing

First test:

Acc.
[0.64472498 0.84395351 0.64632802 0.13325318 0.17969141 0.12954614
 0.93497646 0.39469993 0.50490933 0.90997896 0.16245867 0.93377417
 0.35196874]

Second test:

Acc.
[0.02609959 0.40992886 0.27326921 0.13325318 0.18154494 0.20203386
 0.92084961 0.48176535 0.49509067 0.39339746 0.76605551 0.04207995
 0.75288047]

Is there something I forget to do? The results are so weird.

csmliu commented 4 years ago

Did you run the two commands twice or only run the second command twice?

DateBro commented 4 years ago

I only repeat the second command several times and find the results are different from each other and your quantitative.results in #18 .

csmliu commented 4 years ago

Well, maybe I have found the problem. Please use the original code, and test via

cd att_classification
python test.py --img_dir ../output/128/sample_testing
DateBro commented 4 years ago

emmm The results still have the same problem.

csmliu commented 4 years ago

Could you download the repo and try again using the original att_classification folder?

csmliu commented 4 years ago

I've just tested the code with Ubuntu 18.04 and TensorFlow 1.12.

DateBro commented 4 years ago

After cloning the repo and try again, I got the correct accuracy as quantitative.results. In the last repo, I only add the init code following StackOverflow to make it run. Anyway, thanks for your help and I'll figure out what's the problem in my machine or the modified code.