Team-TUD / CTAB-GAN-Plus

Official GitHub for CTAB-GAN+
69 stars 10 forks source link

ValueError: Cannot convert non-finite values (NA or inf) to integer #12

Closed anderdnavarro closed 12 months ago

anderdnavarro commented 1 year ago

Hi,

I'm trying to train your model with a dataset that only contains one integer variable (1 col x 22928 rows), but after the training (the model is saved) I obtain the following error:

Traceback (most recent call last):
  File "/ander/ctab-gan/scripts/train_CTABGAN_integer.py", line 48, in <module>
    training()
  File "/ander/venv/ctabgan20/lib/python3.7/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/ander/ctabgan20/lib/python3.7/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/ander/ctabgan20/lib/python3.7/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/ander/ctabgan20/lib/python3.7/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/ander/ctab-gan/scripts/train_CTABGAN_integer.py", line 33, in training
    syn = synthesizer.generate_samples(10000)
  File "/ander/ctab-gan/model/ctabgan.py", line 62, in generate_samples
    sample_df = self.data_prep.inverse_prep(sample)
  File "/ander/ctab-gan/model/pipeline/data_preparation.py", line 125, in inverse_prep
    df_sample[column] = df_sample[column].astype(int)
  File "/ander/ctabgan20/lib/python3.7/site-packages/pandas/core/generic.py", line 5877, in astype
    new_data = self._mgr.astype(dtype=dtype, copy=copy, errors=errors)
  File "/ander/ctabgan20/lib/python3.7/site-packages/pandas/core/internals/managers.py", line 631, in astype
    return self.apply("astype", dtype=dtype, copy=copy, errors=errors)
  File "/ander/ctabgan20/lib/python3.7/site-packages/pandas/core/internals/managers.py", line 427, in apply
    applied = getattr(b, f)(**kwargs)
  File "/ander/ctabgan20/lib/python3.7/site-packages/pandas/core/internals/blocks.py", line 673, in astype
    values = astype_nansafe(vals1d, dtype, copy=True)
  File "/ander/ctabgan20/lib/python3.7/site-packages/pandas/core/dtypes/cast.py", line 1068, in astype_nansafe
    raise ValueError("Cannot convert non-finite values (NA or inf) to integer")
ValueError: Cannot convert non-finite values (NA or inf) to integer

I could confirm that the error occurs because sample = self.synthesizer.sample(n) returns NA values in the ctabgan.py file, but I don't know what is happening or how to fix it. I could fence the problem to one specific line of the ctabgan_synthesizer.py script: fake = self.generator(noisez).

My data doesn't have NA values and follows this density distribution: data_density_distribution

I have the same problem training the model with a dataset with two variables (one integer and other categorical).

Thank you very much! Ander

zhao-zilong commented 1 year ago

Hi @anderdnavarro Could you show me how you set up the parameters? Did you only specify the interger column? Your problem is indeed abnormal. How many epochs did you train? You can try to train one epoch to see because one epoch is enough to let CTABGAN to generate data.

Best,

Zilong

anderdnavarro commented 1 year ago

Hi @zhao-zilong,

Thank you very much for your quick response!

The parameters are:

synthesizer = CTABGAN(raw_csv_path = csv,
        test_ratio = 0.3,
        categorical_columns = [],
        log_columns = [],
        mixed_columns = {},
        general_columns= [],
        integer_columns = ['value'],
        problem_type= {None: None},
        epochs = epochs,
        batch_size = batch_size)

I read in #3 something related to the mix type, so I ran a couple of tests with mixed_columns = {'value':[0]}, just to check what happens, although it's not my case. With this configuration, I don't get any result, but now the problem is related to #7. I'm running an attempt with more epochs and if not I'll change the code to don't check the quality of the simulation.

Let me know if you need more information.

Thanks! Ander

zhao-zilong commented 1 year ago

Hi @anderdnavarro

I don't think you have the mixed type column for your data. Just to make sure it's not the problem of your data, can you just use first 100 rows to train your model and tell me the result? You don't need to train too many epochs, one is enough to test that. I don't think the bug is epoch-related.

Best,

Zilong

anderdnavarro commented 1 year ago

I did several tests (I repeat some of them to see if I obtain always the same result):

When it fails is always the same original error: ValueError: Cannot convert non-finite values (NA or inf) to integer

I can share with you the training file if you want.

Thanks!! Ander

zhao-zilong commented 1 year ago

@anderdnavarro
yeah, please. My email is imzhaozilong@gmail.com This is really strange.

Zilong

zhao-zilong commented 1 year ago

Hi @anderdnavarro This is indeed an interesting bug, but unfortunately, I can only reproduce it 1 within 20 try. I don't know why and it is difficult to reproduce in my side. But acutally, I can give you a nasty solution for this. You just move out the 'value' from the setting

integer_columns = ['value'],

You can just let it generate float data. And then you do the transfer explicitly

syn['value'].astype(int)

With that, I never encountered the problem again. I originally wants to use this method to debug, but after setting like that, I never met this bug again......... Have a try and looking forward to your feedback.

Zilong

anderdnavarro commented 1 year ago

Hi @zhao-zilong,

I continue with the issue, the result after the syn['value'].astype(int) transformation with different epochs and bach_sizes is:

value
""
""
""
""
""

And I get the same ValueError: Cannot convert non-finite values (NA or inf) to integer error after syn['value'].astype(int), as expected.

I double checked that my conda environment has the same version of the packages, and it does:

name: ctabgan
dependencies:
  - python=3.7
  - pip=20.2.4
  - pandas=1.2.4
  - scipy=1.4.1
  - biopython=1.78
  - jupyter=1.0.0
  - pip:
    - numpy==1.21.0
    - scikit-learn==0.24.1
    - torch==1.9.1
    - dython==0.6.4.post1
    - tqdm==4.65.0
    - pyfaidx==0.7.2.1
    - click==8.1.2

If you are using other python version or see something weird I think I can create a docker image with this requirements, just to check that the problem is not because of that.

BTW, I have all the models saved (.pkl), will they be useful for you?

Thanks! Ander

zhao-zilong commented 1 year ago

Hi @anderdnavarro my environment:

pyhthon3.10.12
numpy 1.25.2
pandas 2.1.0
torch 2.0.1 
dython 0.5.1
scipy 1.11.2

I just tested in a random computer, not the original one that I publish this code, but above environment let me generate data without problem

Zilong

zhao-zilong commented 1 year ago

Hi @anderdnavarro

Some updates here. I took some times to investigate this problem. It is interesting that when this bug happens, actually all the generator paramters becomes "nan". In the end, I locate the bug. It is from the calculation of gradient penalty. In this line: https://github.com/Team-TUD/CTAB-GAN-Plus/blob/6d72fda3a9f382339e55cb4b35befced4c1f3508/model/synthesizer/ctabgan_synthesizer.py#L323 In some situations, it will become NaN. So then the problem becomes "the gradient penalty of wasserstein GAN becomes NaN". I searched online and see this: https://github.com/Team-TUD/CTAB-GAN-Plus/blob/6d72fda3a9f382339e55cb4b35befced4c1f3508/model/synthesizer/ctabgan_synthesizer.py#L323 Seems you can add number to the calculated gradient to solve that. But to be honest, I didn't find the exact solution for this problem. So I will leave it like that.

Best,

Zilong

anderdnavarro commented 1 year ago

Hi @zhao-zilong,

I updated my environment but I still have the same issue, although it's true that I can train with a slightly higher number of epochs until it appears.

I followed both suggestions they made for "Gradient of gradient explodes(nan) when training WGAN-GP on Mnist #2534":

But it didn't solve the problem. I don't really know what is happening, because I trained the model with the whole database (10 variables, including this) and it worked.

Now that you discovered that this is due to an error in the calculation of the gradient penalty, I think I will continue with your previous model "CTAB-GAN" just for this, as this step is not implemented.

Thank you very much for your help!! Ander