fjxmlzn / DoppelGANger

[IMC 2020 (Best Paper Finalist)] Using GANs for Sharing Networked Time Series Data: Challenges, Initial Promise, and Open Questions
http://arxiv.org/abs/1909.13403
BSD 3-Clause Clear License
293 stars 75 forks source link

Attribute problematic result #41

Closed tzimbolis closed 1 year ago

tzimbolis commented 1 year ago

Hello! I have succesfully ran the DoppelGANger code for autonomous vehicle - pedestrian interaction data. The problem is that the resulting samples are somewhat illogical. For example, a particular discrete attribute that ranged from 0 to around 10 in the data that were inputed, has value=1 for every single generated sample. I wonder if you have any input on what I could do (if anything) to fix that. Thanks!

fjxmlzn commented 1 year ago

Could you share more details on the hyper-parameters that you are using, and examples of some rows in data_feature, data_attribute, and data_gen_flag in data_train.npz?

dgtriantis commented 1 year ago

An additional problem I found is that in the resulting data samples, there are values over 1 and less than -1, with the values not conforming adequately to the input data.

The hyperparameters are:

" "batch_size": 100, "vis_freq": 200, "vis_num_sample": 5, "d_rounds": 1, "g_rounds": 1, "num_packing": 1, "noise": True, "feed_back": False, "g_lr": 0.001, "d_lr": 0.001, "d_gp_coe": 10.0, "gen_feature_num_layers": 1, "gen_feature_num_units": 100, "gen_attribute_num_layers": 3, "gen_attribute_num_units": 100, "disc_num_layers": 5, "disc_num_units": 200, "initial_state": "random", "attr_d_lr": 0.001, "attr_d_gp_coe": 10.0, "g_attr_d_coe": 1.0, "attr_disc_num_layers": 5, "attr_disc_num_units": 200

"epoch": [600], "run": [0], "sample_len": [1, 5, 10, 20], "extra_checkpoint_freq": [5], "epoch_checkpoint_freq": [1], "aux_disc": [True], "self_norm": [True]

data_feature:

    [[[ 0.40238473,  0.54027295, -0.06659341,  0.13676707],
    [ 0.40435192,  0.5429528 , -0.05841919,  0.1444367 ],
    [ 0.40502715,  0.54378265, -0.05112771,  0.15125336],
    ...,
    [ 0.        ,  0.        ,  0.        ,  0.        ],
    [ 0.        ,  0.        ,  0.        ,  0.        ],
    [ 0.        ,  0.        ,  0.        ,  0.        ]],

   [[ 0.4190874 ,  0.22841692, -0.06659341,  0.13676707],
    [ 0.41735172,  0.2271396 , -0.05560642,  0.14006932],
    [ 0.41562375,  0.22764418, -0.04452052,  0.14354612],
    ...,
    [ 0.        ,  0.        ,  0.        ,  0.        ],
    [ 0.        ,  0.        ,  0.        ,  0.        ],
    [ 0.        ,  0.        ,  0.        ,  0.        ]],

   ...,

   [[ 0.375146  ,  0.56082535, -0.06659341,  0.13676707],
    [ 0.3768556 ,  0.5613334 , -0.05965237,  0.14395174],
    [ 0.3786404 ,  0.5616295 , -0.05256618,  0.15087192],
    ...,
    [ 0.        ,  0.        ,  0.        ,  0.        ],
    [ 0.        ,  0.        ,  0.        ,  0.        ],
    [ 0.        ,  0.        ,  0.        ,  0.        ]],

   [[ 0.16939865,  0.5071855 , -0.06659341,  0.13676707],
    [ 0.17080542,  0.5078957 , -0.05965237,  0.14395174],
    [ 0.17278804,  0.5082092 , -0.05256618,  0.15087192],
    ...,
    [ 0.        ,  0.        ,  0.        ,  0.        ],
    [ 0.        ,  0.        ,  0.        ,  0.        ],
    [ 0.        ,  0.        ,  0.        ,  0.        ]],

   [[ 0.4324977 ,  0.5378058 , -0.06659341,  0.13676707],
    [ 0.4324977 ,  0.5378058 , -0.05965237,  0.14395174],
    [ 0.4324977 ,  0.5378058 , -0.05256618,  0.15087192],
    ...,
    [ 0.        ,  0.        ,  0.        ,  0.        ],
    [ 0.        ,  0.        ,  0.        ,  0.        ],
    [ 0.        ,  0.        ,  0.        ,  0.        ]]],
  dtype=float32)

data_attribute (max value is 63, min value is 0):

   [[ 1.],
   [ 1.],
   [13.],
   ...,
   [17.],
   [17.],
   [17.]], dtype=float32)

data_gen_flag:

   [[1., 1., 1., ..., 0., 0., 0.],
   [1., 1., 1., ..., 0., 0., 0.],
   [1., 1., 1., ..., 0., 0., 0.],
   ...,
   [1., 1., 1., ..., 0., 0., 0.],
   [1., 1., 1., ..., 0., 0., 0.],
   [1., 1., 1., ..., 0., 0., 0.]], dtype=float32)
dgtriantis commented 1 year ago

I posted this question by a different account by mistake, I am the same person with @tzimbolis.

fjxmlzn commented 1 year ago

Thank you! Could you please also provide the content inside "data_feature_output.pkl" and "data_attribute_output.pkl"?

dgtriantis commented 1 year ago
         data_feature_output = [Output(type_=OutputType.CONTINUOUS, dim=4, 
         normalization=Normalization.MINUSONE_ONE, is_gen_flag=False)]

         data_attribute_output = [Output(type_=OutputType.DISCRETE, dim=1, normalization=None, is_gen_flag=False)]

If you'd like I can also send you an email with the data, in case you want to check something yourself. Thanks

fjxmlzn commented 1 year ago

Thanks for the information.

One issue is that the data attribute needs to be stored in one-hot encoding, i.e., if the value is 0, that row should be [1, 0, 0, ...] (63 zeros after the 1) if the value is 1, that row should be [0, 1, 0, 0, ...] (62 zeros after the 1) ... if the value is 63, that row should be [0, 0, ..., 0, 1] (63 zeros before the 1) The shape of data_attribute should be [number of samples, 64]

In addition, data_attribute_output should be

data_attribute_output = [Output(type_=OutputType.DISCRETE, dim=64, normalization=None, is_gen_flag=False)]

Please let me know if the results are still not as expected after fixing this.

dgtriantis commented 1 year ago

The results after making the proposed changes (data_attributeoutput = [Output(type=OutputType.DISCRETE, dim=max_len, normalization=None, is_gen_flag=False)], where max_len = 64), the attributes are not discrete:

    array([[5.0039098e-08, 6.6108881e-03, 2.2625262e-01, ..., 4.9293556e-09,
    2.1883304e-10, 1.4889185e-08],
   [1.7649402e-17, 4.5475384e-04, 1.2486913e-19, ..., 9.4630347e-23,
    4.4634356e-19, 2.8835797e-20],
   [1.9636389e-18, 1.1713755e-03, 8.3317159e-04, ..., 6.3824905e-21,
    8.5015989e-24, 4.3212665e-19],
   ...,
   [4.0655615e-07, 3.0217332e-01, 1.0044121e-05, ..., 2.7380713e-09,
    3.3998028e-08, 4.3662823e-08],
   [1.7798087e-16, 1.7455160e-06, 7.8983231e-20, ..., 1.3778067e-18,
    8.6902809e-19, 1.6815877e-18],
   [1.6348466e-08, 6.6678558e-06, 1.1912930e-09, ..., 1.1338585e-11,
    1.4493015e-10, 2.1054688e-11]], dtype=float32)

Do you have any idea why?

dgtriantis commented 1 year ago

Also the generated sample features don't make sense? What could I do about that?

fjxmlzn commented 1 year ago

The code does not discretize the generated attributes yet; they are the raw outputs from softmax. You will need to manually use argmax to get the discrete version of the generated attributes.

Can you share more details on how "the generated sample features don't make sense"? E.g., in what metrics?

dgtriantis commented 1 year ago

So the data that I'm looking to synthesize are pedestrian - autonomous vehicle interactions. As you can understand the trajectories are generally and approximately linear in almost all cases (see an example in the first picture below). On the other hand, the generated data are consisting of convoluted trajectories, which are not realistic for the said interactions (in most cases they also intersect, something that does not happen in any of the approx. 6000 interactions that I input to DoppelGANger) (you can see an example in the second picture). image image

It could be said that there are certain limits-regulations (e.g. specific limit for the difference of the coordinates between two consecutive timesteps) concerning the generated trajectories that need to be in order, could something like that be integrated into the DoppelGANger code (by myself)? If not, is there something else I could try to "rationalize" the generated samples?

fjxmlzn commented 1 year ago

Regarding "there are values over 1 and less than -1". This could happen when 'self_norm' is turned on. Could you please try setting both 'self_norm' and 'aux_disc' to False and see what the results look like?

Regarding 'generated trajectories that need to be in order'. A simpler version of that could be achieved by a simple data preprocessing trick. For example, if we want the x coordinate always to increase, we can preprocess a trajectory to be 'delta x' instead of x. More specifically, assume that the original trajectory is [x_0,x_1,...,x_t], we can add another metadata x_0, and change the time series to [x_1',...,x_t'] where x_i' = xi-x{i-1}. The real x_i' will always be >0, and so will be the generated data. We can then transform the generated x_0 and x_i' back to the original x_i, which will always increase.

This trick has been used in our follow-up paper for learning strictly increasing timestamps: https://dl.acm.org/doi/pdf/10.1145/3544216.3544251

dgtriantis commented 1 year ago

The results that are shared in my previous comment have resulted from input data that were preprocessed in a similar way(the zero timestep AV coordinates for each timeseries were transformed into the origin of the coordinate system for each timeseries) . Regarding "there are values over 1 and less than -1", that is no longer a problem but the results are still not logical (in the same way as before).

Is there a way to add specific conditions to doppelganger by adding some code to the existing one or is it uncompatible for such a transformation?

Thanks

fjxmlzn commented 1 year ago

Sorry I don't understand fully. Would you mind explaining more about "the zero timestep AV coordinates for each timeseries were transformed into the origin of the coordinate system for each timeseries", and what "specific conditions" or "transformation" you need?

dgtriantis commented 1 year ago

I did something similar to what you proposed, which is: I changed the origin of the coordinate system for each sequence so that it is equal to the coordinates of one of the two agents in the first timestep ( the agents are one pedestrian and one autonomous vehicle per sequence-interaction).

When it comes to the conditions, as you can see in the examples a posted previously, the real trajectories are approximately linear, whereas the generated ones are majorly convoluted. The conditions that I am thinking of regulate the generation so that the generated samples are not illogical (e.g. don't have sudden changes in the inclination of the trajectory or have the two agents crossing each other). In short, I am talking about numerical conditions between the features for each timestep of a sequence. For example, if x[i] is the x coordinate in timestep i then: x[i]-x[i-1] < 0.1.

fjxmlzn commented 1 year ago

Got it. Thanks for the explanation. The transformation I mentioned should work.

Assume that the original trajectory is [x_0,x_1,...,x_t], we can add another metadata x_0, and change the time series to [x_1',...,x_t'] where x_i' = xi-x{i-1}.

(Note that, here, x_i' = x_i-x_{i-1}. The approach you mentioned is instead x_i' = x_i-x_0.)

After doing the above transformation on the original data, you will still need to normalize the time series to between [0, 1] (or [-1, 1]) before using DoppelGANger. Therefore, if for the real data, x[i]-x[i-1] < 0.1 is always satisfied, the generated data will also satisfy that (when self_norm=False), as [0, 1] (or [-1, 1]) will be mapped back to a value < 0.1.

dgtriantis commented 1 year ago

When you say "add another metadata x_0", you mean add it seperately in "data_feature_output.pkl"?

fjxmlzn commented 1 year ago

Sorry for the confusion--there is some discrepancy between the wording in the code and what I said. When I say metadata in the paper or here, I mean attribute in the code; when I say measurement or time series in the paper or here, I mean feature in the code.

To get back to your original question, I meant adding it as another dimension in data_attribute, and another output in data_attribute_output.pkl. The reason to add it is to ensure that after generation, we can use it together with the generated xi-x{i-1} in time series part to recover all x_i.

fjxmlzn commented 1 year ago

Feel free to reopen the issue if the problem persists.