Generating NAN values - Githubissues

fahadali127 commented 3 years ago

Model is generating NAN values in features and attributes. My data has following shape: print(data_feature.shape) print(data_attribute.shape) print(data_gen_flag.shape)

(500, 1, 3) (500, 4) (500, 1)

fjxmlzn commented 3 years ago

Would you mind providing more details, e.g., the config.py file you are using?

fahadali127 commented 3 years ago

a | b | c | d | e0 | e1 | f 1618.0 | 2689.0 | 4615.0 | 0 | 0 | 1 | 1

I have data like this. D only contains 0 and f only contains 1.

data_featureoutput = [ output.Output(type=OutputType.CONTINUOUS, dim=1, normalization=Normalization.ZERO_ONE, is_genflag=False), output.Output(type=OutputType.CONTINUOUS, dim=1, normalization=Normalization.ZERO_ONE, is_genflag=False), output.Output(type=OutputType.CONTINUOUS, dim=1, normalization=Normalization.ZERO_ONE, is_gen_flag=False), ]

data_attributeoutput = [ output.Output(type=OutputType.DISCRETE, dim=1, normalization=None, is_genflag=False), output.Output(type=OutputType.DISCRETE, dim=2, normalization=None, is_genflag=False), output.Output(type=OutputType.DISCRETE, dim=1, normalization=None, is_gen_flag=False)]

I used this format to prepare the data.

Data gen flag was created like this. Sequence length is 1 in my case data_gen_flag = np.ones((data_feature.shape[0], SEQ_LEN), dtype="float32")

Shapes of all:

print(data_feature.shape) print(data_attribute.shape) print(data_gen_flag.shape) (500, 1, 3) (500, 4) (500, 1)

Now when I run the training notebook. Data looks good before normalizing the samples. After normalizing, data looks as below:

Data feature: Shape: (500, 1, 5)

array([[[ 0., nan, nan, 0., 1.]],

   [[ 0., nan, nan,  0.,  1.]],

   [[ 0., nan, nan,  0.,  1.]],

   ...,

   [[ 0., nan, nan,  0.,  1.]],

   [[ 0., nan, nan,  0.,  1.]],

   [[ 0., nan, nan,  0.,  1.]]])

Data attribute: Showing only one sample

Shape: (500, 10)

array([0.0000000e+00, 0.0000000e+00, 1.0000000e+00, 1.0000000e+00, 1.6180000e+03, 1.2207031e-04, 2.6890000e+03, 0.0000000e+00, 4.6150000e+03, 0.0000000e+00], dtype=float32)

I think the issue is in normalize sample part. Please let me know what you think.

fjxmlzn commented 3 years ago

Thanks for the details! I can see two potential issues here.

About the feature part, may I know which "training notebook" you are using? There was a bug in a very early version of the normalization code, which could cause this issue. But it was already fixed in https://github.com/fjxmlzn/DoppelGANger/commit/34efa735b6e7179f6cd4a6c1acaf8c9f2e49308a.

About the attribute part, the dim parameter in output.Output means the number of possibilities for that field, and the corresponding dimension in data should use one-hot encoding. So if a discrete field only has 1 possibility (i.e., 'D only contains 0 and f only contains 1.' as you said), then the corresponding dimension in data should be 1 (i.e., one-hot encoding of a 1-dimension categorical variable is 1). (Although modeling those dimensions with fixed values with GANs are not that useful, but I assume you are just doing some initial test of the data and code.)

fahadali127 commented 3 years ago

I am using your training code without GPU. I haven't one-hot encoded the D and F columns. I have only one hot encoded the E column to E0 and E1 and I have put dimension as 2 in the output.

Yes, I am using this normalization code. I am unable to understand this issue

fjxmlzn commented 3 years ago

About the attribute part, you should use one-hot encoding for all discrete fields. To fix this, it is as simple as changing d from zeros to ones.

For the feature part, I just realized another issue. If you are putting normalization=Normalization.ZERO_ONE, you should make sure that the corresponding dimensions have values between zero and one (by normalizing them before putting it to DoppelGANger). But it seems like the values are beyond the range. But that seems to be an independent issue to the nan problem. I cannot see why it happens if you are indeed using the latest code. Could you please share the code and data (a sub-sample that can reproduce the problem you see is sufficient) somewhere (e.g., on Google drive) so that I can take a look?

fahadali127 commented 3 years ago

Yes sure, here is the link to a folder containing a sample of the data and the notebooks I am using to prepare the data and for training. https://drive.google.com/drive/folders/1NGYyAIphe2MmowL5T32sDIM0OKVFeOO0?usp=sharing

fahadali127 commented 3 years ago

It is a time-series data

fjxmlzn commented 3 years ago

Thanks for sharing! I just checked the training code. The normalization part is NOT from the latest version in this repo. Please incorporate https://github.com/fjxmlzn/DoppelGANger/commit/34efa735b6e7179f6cd4a6c1acaf8c9f2e49308a into it.

fahadali127 commented 3 years ago

I am using the same code.

` def normalize_per_sample(data_feature, data_attribute, data_feature_outputs, data_attribute_outputs, eps=1e-4):

# assume all samples have maximum length
data_feature_min = np.amin(data_feature, axis=1)
data_feature_max = np.amax(data_feature, axis=1)

additional_attribute = []
additional_attribute_outputs = []
dim = 0
for output in data_feature_outputs:
    if output.type_ == OutputType.CONTINUOUS:
        for _ in range(output.dim):
            max_ = data_feature_max[:, dim] + eps
            min_ = data_feature_min[:, dim] - eps

            additional_attribute.append((max_ + min_) / 2.0)
            additional_attribute.append((max_ - min_) / 2.0)

            additional_attribute_outputs.append(Output(
                type_=OutputType.CONTINUOUS,
                dim=1,
                normalization=output.normalization,
                is_gen_flag=False))
            additional_attribute_outputs.append(Output(
                type_=OutputType.CONTINUOUS,
                dim=1,
                normalization=Normalization.ZERO_ONE,
                is_gen_flag=False))

            max_ = np.expand_dims(max_, axis=1)
            min_ = np.expand_dims(min_, axis=1)

            data_feature[:, :, dim] = (data_feature[:, :, dim] - min_) / (max_ - min_)

            if output.normalization == Normalization.MINUSONE_ONE:
                data_feature[:, :, dim] = data_feature[:, :, dim] * 2.0 - 1.0

            dim += 1
    else:
        dim += output.dim

real_attribute_mask = ([True] * len(data_attribute_outputs) +
                       [False] * len(additional_attribute_outputs))

additional_attribute = np.stack(additional_attribute, axis=1)

data_attribute = np.concatenate([data_attribute, additional_attribute], axis=1)

data_attribute_outputs.extend(additional_attribute_outputs)

return data_feature, data_attribute, data_attribute_outputs, \
    real_attribute_mask`

Thanks for sharing! I just checked the training code. The normalization part is NOT from the latest version in this repo. Please incorporate 34efa73 into it.

fahadali127 commented 3 years ago

I think the issue is in this part:

data_feature[:, :, dim] = (data_feature[:, :, dim] - min_) / (max_ - min_)

Please have a look at the normalize sample and renormalize sample functions.

fjxmlzn commented 3 years ago

Your training_notebook.ipynb did NOT call gan.util. normalize_per_sample at all. It directly normalize the samples in training_notebook.ipynb. And how it normalizes it is in the old way, without https://github.com/fjxmlzn/DoppelGANger/commit/34efa735b6e7179f6cd4a6c1acaf8c9f2e49308a in it:

data_feature_min = np.amin(data_feature, axis=1)
data_feature_max = np.amax(data_feature, axis=1)

additional_attribute = []
additional_attribute_outputs = []

dim = 0
for output in data_feature_outputs:
    if output.type_ == OutputType.CONTINUOUS:
        for _ in range(output.dim):
            max_ = data_feature_max[:, dim]
            min_ = data_feature_min[:, dim]

            additional_attribute.append((max_ + min_) / 2.0)
            additional_attribute.append((max_ - min_) / 2.0)
            additional_attribute_outputs.append(Output(
                type_=OutputType.CONTINUOUS,
                dim=1,
                normalization=output.normalization,
                is_gen_flag=False))
            additional_attribute_outputs.append(Output(
                type_=OutputType.CONTINUOUS,
                dim=1,
                normalization=Normalization.ZERO_ONE,
                is_gen_flag=False))

            max_ = np.expand_dims(max_, axis=1)
            min_ = np.expand_dims(min_, axis=1)

            data_feature[:, :, dim] = \
                (data_feature[:, :, dim] - min_) / (max_ - min_)
            if output.normalization == Normalization.MINUSONE_ONE:
                data_feature[:, :, dim] = \
                    data_feature[:, :, dim] * 2.0 - 1.0

            dim += 1
    else:
        dim += output.dim

Moreover, I think you only uploaded the main script (training_notebook.ipynb), without all other dependent python files. So I have no idea what version of other files you are using. (But that doesn't matter as the script does not call gan.util. normalize_per_sample at all.)

fjxmlzn commented 3 years ago

I'll close the issue for now. Feel free to reopen it if you still experience issues :)

fjxmlzn / DoppelGANger

Generating NAN values #14