BAAI-DCAI / Training-Data-Synthesis

[ICLR 2024] Real-Fake: Effective Training Data Synthesis Through Distribution Matching
MIT License
69 stars 2 forks source link

Question about the theory #3

Open yuntaodu opened 7 months ago

yuntaodu commented 7 months ago

Hi, thanks for your great work. I am kind of confused about how the first pivotal factor: (1) Training and testing data distribution discrepancy is concluded from equ 3. It is not obvious and could you share some understanding?

YuanJianhao508 commented 7 months ago

Hi yuntaodu, Thanks for your question! Eq(3) basically says the probability that the difference between testing and training error is lower than a constant (i.e. the square root expression) is very high (i.e. higher than 1-a, where a is small). The thing is, for this equation to make sense, S (i.e. training set) needs to be sampled from target distribution D (As shown in the very left hand side of Eq.3). This could be somehow trivial when people do random sampling for train-val-test split on a large dataset (as all data are by nature collected in the same way and from the same distribution), but when it comes to data synthesis there is no guarantee (or at least for some data synthesis method, it is not guaranteed) that the synthetic data are actually in-distribution with target distribution.
I hope this can help! and maybe you could find Appendix B can also be helpful!

yuntaodu commented 7 months ago

Thanks for your reply. I have understand what you mean. It is mostly because Equ (3) is based on the IID assumption, where the test sample (real data) and the training samples (syn data) are from the same distribution. So we would assume the distribution of syn should be close to that of real data.

Besides, I also have some questions about the code. 1) The folder of "diffusers" is empty when I download the code. Is there anything in this folder or Is there any link to the folder? 2) How could we apply this method to other datasets for generation, such as CIFAR100, STL, and so on?

YuanJianhao508 commented 7 months ago

Hi yuntaodu, Nice to hear that your previous question is solved! Regarding the code, we will shortly release the code, and you can then follow the step described in the readmd file to generate for other datasets!