Incomplete training - Githubissues

fjxmlzn commented 2 years ago

Recently, I met with another problem. I tried to run main.py in the example_training file and main_generate_data.py in the example_generatingdata file. However, the result was that only a file named results was created. And in sub-files of 'results', there was only a worker*.log.txt. Q1: Why no synthetic datasets of [web/google/FCC_MBA] were generated? Snipaste_2022-07-24_23-13-48 I looked for whether there is a place in the code to specify the dataset path. But I found nothing.

Q2: When I know the attributes and features of my datasets, how to generate the four files including data_attribute_output.pkl, data_feature_output.pkl, data_test.npz and data_train.npz. Whether another codes need to be written to achieve this work?

At last, thank you for your continued patient answers.

Originally posted by @chameleonzz in https://github.com/fjxmlzn/DoppelGANger/issues/3#issuecomment-1193338422

chameleonzz commented 2 years ago

According to the previous 'work.log', I found that maybe my TF has something wrong (there were multiple TFs). Therefore, I re-install TF, and run example_training/main.py again. Now, the output is as follows. worklog In the aux_disc-False,dataset-FCC_MBA,epoch-17000,epoch_checkpoint_freq-70,extra_checkpoint_freq-850,run-0,sample_len-1,self_norm-False, file, the continet of work.log is as follows. In the last raw, it showed: FileNotFoundError: [Errno 2] No such file or directory: '../results/aux_disc-False,dataset-FCC_MBA,epoch-17000,epoch_checkpoint_freq-70,extra_checkpoint_freq-850,run-0,sample_len-1,self_norm-False,\sample\epoch_id-69,batch_id--1,global_id-419,type-free,feature,output-0,dim-0.png' [0m I think this is about to run successfully. I am debugging recently based on the worker.log suggestion. Thank you very much for your continued help.

chameleonzz commented 2 years ago

I think I can run the DG rightly now. To solve the above problem, I try to debug main.py in example_training(without_GPUTaskScheduler). Then I amended doppelganer.py in gan. I deleted checkpoint_dir in the last row. And the code could run properly. It cost about 22 hours, just as follows. (My computer is i7-10750H CPU, NVIDIA GeForce RTX 2060 GPU and 32GB) results-example_training(without_GPUTaskScheduler) In the example_training(without_GPUTaskScheduler)/test, there are three files, including checkpoint, sample, and time.txt. The checkpoint file includes many documents, as follows. 2-1-checkpoint In addition, the sample file comprises a sea of picture files, as follows. There are around 19,000 pictures and several npz files. 2-2-sample

Is the right results of running example_training(without_GPUTaskScheduler)/main.py? If it is right, how to generate synthesis data of web/goggle/FCC_MBA?

fjxmlzn commented 2 years ago

Yes, it is the right result with this code.

Regarding the FileNotFoundError you posted in https://github.com/fjxmlzn/DoppelGANger/issues/30#issuecomment-1196779071, it should have already been fixed in https://github.com/fjxmlzn/DoppelGANger/commit/c2f4bfbb890c6c3d9952d8c51d58369d6b288c51 in June 2022. Please re-clone the repo and rerun and check if that works.

Regarding data generation for web, you can use https://github.com/fjxmlzn/DoppelGANger/tree/master/example_generating_data(without_GPUTaskScheduler) (before re-runing the above training code).

The above "without_GPUTaskScheduler" version of training and generation codes are only for web dataset. For other datasets (google, FCC_MBA), you can either modify the hyper-parameters according to the config files https://github.com/fjxmlzn/DoppelGANger/blob/master/example_training/config.py, or directly use the version with GPUTaskScheduler (https://github.com/fjxmlzn/DoppelGANger/tree/master/example_training and https://github.com/fjxmlzn/DoppelGANger/tree/master/example_generating_data)

Let me know if you run into any issues with the code.

chameleonzz commented 2 years ago

Yes, it is the right result with this code.

Regarding the FileNotFoundError you posted in #30 (comment), it should have already been fixed in c2f4bfb in June 2022. Please re-clone the repo and rerun and check if that works.

Regarding data generation for web, you can use https://github.com/fjxmlzn/DoppelGANger/tree/master/example_generating_data(without_GPUTaskScheduler) (before re-runing the above training code).

The above "without_GPUTaskScheduler" version of training and generation codes are only for web dataset. For other datasets (google, FCC_MBA), you can either modify the hyper-parameters according to the config files https://github.com/fjxmlzn/DoppelGANger/blob/master/example_training/config.py or directly use the version with GPUTaskScheduler (https://github.com/fjxmlzn/DoppelGANger/tree/master/example_training and https://github.com/fjxmlzn/DoppelGANger/tree/master/example_generating_data)

Let me know if you run into any issues with the code.

After modifying example_training/config,py and other config*.py according to c2f4bfb, it also had the same error information after re-running the code, just as showed in 30(comment).

In the 'aux_disc-False,dataset-FCC_MBA,epoch-17000,epoch_checkpoint_freq-70,extra_checkpoint_freq-850,run-,sample_len-,self_norm-False,\sample', there was only a npz file named 'epoch_id-69,batch_id--1,global_id-419,type-free,samples.npz'. And in the 'aux_disc-False,dataset-google,epoch-400,epoch_checkpoint_freq-1,extra_checkpoint_freq-5,run-0,sample_len-1,self_norm-False,\sample', it had the same situation. However, in the 'aux_disc-True,dataset-web,epoch-400,epoch_checkpoint_freq-1,extra_checkpoint_freq-5,run-0,sample_len-1,self_norm-True,\sample', there were many files, including lots of pictures and two npz files. But the 'worker.log' also had the likely error information: 'FileNotFoundError: [Errno 2] No such file or directory: '..\results\aux_disc-True,dataset-web,epoch-400,epoch_checkpoint_freq-1,extra_checkpoint_freq-5,run-0,sample_len-1,self_norm-True,\sample\epoch_id-0,batch_id-199,global_id-199,type-teacher,attribute,output-3,dim-0.png' [0m".

fjxmlzn commented 2 years ago

This looks weird. Could you please attach worker.log in these three folders here? Thank you!

chameleonzz commented 2 years ago

This looks weird. Could you please attach worker.log in these three folders here? Thank you!

OK, I sent you an email.

fjxmlzn commented 2 years ago

Thank you. Since I believe we found the root cause of this issue, I am closing this issue now.

For future readers of this thread, the issue is that Windows system has a max path length requirement, and a FileNotFoundError will be raised when writing to a path that exceeds this length.

To reduce the length of paths, we can add some keys into ignored_keys_for_folder_name in the config file so that they do not appear in the folder name. For example, we can change the top part of https://github.com/fjxmlzn/DoppelGANger/blob/master/example_training/config.py to

config = {
    "scheduler_config": {
        "gpu": ["0"],
        "config_string_value_maxlen": 1000,
        "result_root_folder": os.path.join("..", "results”),
    “ignored_keys_for_folder_name”: ['extra_checkpoint_freq', 'epoch_checkpoint_freq', 'aux_disc', 'self_norm']
    },

See https://github.com/fjxmlzn/GPUTaskScheduler for more details of the config options of GPUTaskScheduler. Alternatively, we can try moving the entire folder of DoppelGANger to a path that is shorter.

fjxmlzn / DoppelGANger

Incomplete training #30