cadurosar / pgdl

PGDL competition
Other
3 stars 0 forks source link

Unable to reproduce the performance #1

Open chingyaoc opened 3 years ago

chingyaoc commented 3 years ago

Hi, Hope you are doing well!

I attempt to reproduce the performance of WCV on public task 1 and 2. However, the mutual information scores I got are 0.172 and 0.317, respectively. I did not modify any code. Would you mind provide some hint to reproduce the results?

Thanks!

cadurosar commented 3 years ago

Hi,

Indeed it seems that there is something that changes the dataset order and thus the results (as we only use one graph for WCV, the results have a very high variance depending on the order of the dataset). As I do not have access to the machine I made the first experiments, the results I am getting when I try to reproduce are different from yours and the ones I had before: 0.12 and 0.31. Note that we had provided the metrics for each network and each test on folder ingestion/model, so it should be possible to retest this and find the culprit.

I am sorry for this and I am trying to see what is impacting the dataset order (maybe the tensorflow version? the order the files were extracted?) and will get back to you as soon as I have a clue, but it could take me a while.

cadurosar commented 3 years ago

As I expected the problem is that data is not loaded in deterministic order. Thanks for finding this bug. I have uploaded a quick fix, so that for reproducing some of the results WCV and VR (VPM_1 and VPM_80 will need more testing to ensure that the order is correct): TF_CPP_MIN_LOG_LEVEL=3 python ingestion_program/ingestion_tqdm_public.py {INPUT_DATA_PATH} ingestions/{SUBMISSION_NAME} ingestion_program {SUBMISSION_NAME}

@chingyaoc, could you please test the quick fix on your end and see if the VR and WCV results are reproducible? Unfortunately this quick fix will only work for the public data and I will have to do more tests to ensure that it works for the VPM model (as it uses more data and has random factors of its own) and phase_one/phase_two datasets.

chingyaoc commented 3 years ago

Hi @cadurosar , thanks for the update! I am able to reproduce the performance of WCV on task 1 and task 2 now. Would you mind providing some intuition about why deterministic order matters?

Thank you for your assistance :)

cadurosar commented 3 years ago

Hi @cadurosar , thanks for the update! I am able to reproduce the performance of WCV on task 1 and task 2 now. That's great to hear!

Would you mind providing some intuition about why deterministic order matters?

Yes for sure, so the method is based on latent graphs. These graphs are created with samples of the training set. We then compute some kind of metric using these samples (WCV for example gets the worst label smoothness over graphs of the same samples using similarities from different layers of the network). Ideally, one would perform successive sampling in order to generate a bunch of graphs and use some aggregation of those graphs (sum, average, median) to reduce the variation of the results.

However, the WCV uses only one graph as it is coded (due to some computational constraints that we had). Consider that it will collect 50 examples per class. By using only one "sampling" the statistics of that sampling may not reflect the overall database, leading to a higher variance, especially if you change the training set order. Thus a deterministic order to get the examples is a must to ensure reproducibility. By using more graphs we could have a "close-enough" reproducibility even without deterministic order, but unfortunately, we could not do that during the competition.

Hope that helped, and if you have any more questions please do not hesitate to ask.