amazon-science / tabsyn

Official Implementations of "Mixed-Type Tabular Data Synthesis with Score-based Diffusion in Latent Space""
Apache License 2.0
95 stars 28 forks source link

About using our own dataset #10

Open bruno686 opened 8 months ago

bruno686 commented 8 months ago

Hi, author I really appreciate that this paper is open source and very detailed. I think the introduction to using our own dataset is still not detailed enough in the Readme. In addition to manipulating the steps that have now been given, after my attempts I also need "python process_dataset.py" and other steps. I think if this step could be refined in detail, it might further expand the impact.

Best,

bruno686 commented 8 months ago

Hi, when we use our own dataset, there are a lot of Nan in tabddpm, and there will be "Finding Nan" and stop, why is this?

Step 1/100000 MLoss: 121.1736 GLoss: 1.3087 Sum: 122.4823 Step 2/100000 MLoss: nan GLoss: 82634464.0 Sum: nan Step 3/100000 MLoss: nan GLoss: 11660486.0 Sum: nan Step 4/100000 MLoss: nan GLoss: 2420568.0 Sum: nan Step 5/100000 MLoss: -3.6622 GLoss: 2785576.25 Sum: 2785572.5878 Step 6/100000 MLoss: nan GLoss: 2638546.5 Sum: nan Step 7/100000 MLoss: nan GLoss: 107786496.0 Sum: nan Step 8/100000 MLoss: 1.5023 GLoss: 461215.1875 Sum: 461216.6898 Step 9/100000 MLoss: nan GLoss: 45193.2305 Sum: nan Step 10/100000 MLoss: 200.9354 GLoss: 3201010.5 Sum: 3201211.4354 Step 11/100000 MLoss: nan GLoss: 22431708.0 Sum: nan Step 12/100000 MLoss: -6.3746 GLoss: 6669858.0 Sum: 6669851.6254 Step 13/100000 MLoss: -8.564 GLoss: 2594830.0 Sum: 2594821.436 Step 14/100000 MLoss: nan GLoss: 1166497.625 Sum: nan Step 15/100000 MLoss: nan GLoss: 1074179.25 Sum: nan Step 16/100000 MLoss: -5.7084 GLoss: 605065.25 Sum: 605059.5416 Step 17/100000 MLoss: -7.9741 GLoss: 227135.2969 Sum: 227127.3228 Step 18/100000 MLoss: nan GLoss: 301363.7812 Sum: nan Step 19/100000 MLoss: 674.9831 GLoss: 88469.3438 Sum: 89144.3269 Step 20/100000 MLoss: nan GLoss: 227379.3438 Sum: nan Step 21/100000 MLoss: nan GLoss: 14099.8574 Sum: nan

hengruizhang98 commented 8 months ago

Hi, author I really appreciate that this paper is open source and very detailed. I think the introduction to using our own dataset is still not detailed enough in the Readme. In addition to manipulating the steps that have now been given, after my attempts I also need "python process_dataset.py" and other steps. I think if this step could be refined in detail, it might further expand the impact.

Best,

Yes, you are right, you have to run "python process_dataset.py --name [NAME_OF_YOUR_DATASET] to process your data. I will revise the README file to reflect this point.

hengruizhang98 commented 8 months ago

Hi, when we use our own dataset, there are a lot of Nan in tabddpm, and there will be "Finding Nan" and stop, why is this?

Step 1/100000 MLoss: 121.1736 GLoss: 1.3087 Sum: 122.4823 Step 2/100000 MLoss: nan GLoss: 82634464.0 Sum: nan Step 3/100000 MLoss: nan GLoss: 11660486.0 Sum: nan Step 4/100000 MLoss: nan GLoss: 2420568.0 Sum: nan Step 5/100000 MLoss: -3.6622 GLoss: 2785576.25 Sum: 2785572.5878 Step 6/100000 MLoss: nan GLoss: 2638546.5 Sum: nan Step 7/100000 MLoss: nan GLoss: 107786496.0 Sum: nan Step 8/100000 MLoss: 1.5023 GLoss: 461215.1875 Sum: 461216.6898 Step 9/100000 MLoss: nan GLoss: 45193.2305 Sum: nan Step 10/100000 MLoss: 200.9354 GLoss: 3201010.5 Sum: 3201211.4354 Step 11/100000 MLoss: nan GLoss: 22431708.0 Sum: nan Step 12/100000 MLoss: -6.3746 GLoss: 6669858.0 Sum: 6669851.6254 Step 13/100000 MLoss: -8.564 GLoss: 2594830.0 Sum: 2594821.436 Step 14/100000 MLoss: nan GLoss: 1166497.625 Sum: nan Step 15/100000 MLoss: nan GLoss: 1074179.25 Sum: nan Step 16/100000 MLoss: -5.7084 GLoss: 605065.25 Sum: 605059.5416 Step 17/100000 MLoss: -7.9741 GLoss: 227135.2969 Sum: 227127.3228 Step 18/100000 MLoss: nan GLoss: 301363.7812 Sum: nan Step 19/100000 MLoss: 674.9831 GLoss: 88469.3438 Sum: 89144.3269 Step 20/100000 MLoss: nan GLoss: 227379.3438 Sum: nan Step 21/100000 MLoss: nan GLoss: 14099.8574 Sum: nan

This seems to be a gradient explosion phenomenon. I suggest debugging your code to check the output and gradient at each step. This may be due to the lack of data standardization preprocessing.

arc-arnob commented 1 month ago

Hi! I think this is not yet updated. @bruno686 can you post the steps you followed to configure your own dataset? Thanks : )