fjxmlzn / DoppelGANger

[IMC 2020 (Best Paper Finalist)] Using GANs for Sharing Networked Time Series Data: Challenges, Initial Promise, and Open Questions
http://arxiv.org/abs/1909.13403
BSD 3-Clause Clear License
296 stars 75 forks source link

Timeseries with missing data #18

Closed HarisNaveed17 closed 3 years ago

HarisNaveed17 commented 3 years ago

Hello, I hope you are well.

First of all, amazing work. I particularly enjoyed reading the paper. It was very well-written and easily understandable.

I'm working with an open source internet activity data set. It's a fairly small data set, with only hourly recordings over 5 weeks. From the way I understood the data formatting, I used the 'week of the year' as an attribute and the actual measurements as the feature. The results were pretty impressive and I've attached a simple comparison plot of the real (orange) and generated (blue) sequences below. intactivity_dpg4259

My data format looked something like this: Week of year 0 1 2 3 ..... 168
0 123 456 678 567 ..... 234
1 345 890 787 122 ..... 345
... ..... ..... .... ..... ..... ....

For now, there are 168 hours in each week, so the series length is constant and active on every step. This made for fairly simple data pre-processing. Now suppose I randomly removed some hourly values from each week, and then trained the DG on that new, partial timeseries data.

Can the DG produce unique values for all hours that I could then use to fill in the gaps in the original input? Another way to phrase it would be, if each week had a different series length, can the DG produce the full 168 hours for each week based on the hourly values it gets?

If yes, then what would the Preprocessing of such a data set look like?

My idea is to add the hours as an additional feature, but I'm not sure if I would drop the hours with missing values and let DG pad the end of the timeseries as it does or something else. I'm also not sure how this would be reflected in the data_gen_flag. Would I just show the timeseries as 'off' for those values?

I hope I make sense. I'd love to hear your opinion about whether this sort of generation is possible and a rough idea of what time attributes/features should be included to improve its results. Thank you!

fjxmlzn commented 3 years ago

Thank you! I am excited to hear that it works for your data!

If I understand it correctly, the problem is that some entries in the data table you show are missing, and the positions of the missing entries could be arbitrary, and you want to predict them.

This is a very interesting problem. Unfortunately, DoppelGANger cannot handle this. In fact, we have been thinking of a similar problem (extending DoppelGANger for time series prediction like your case), which would require additional designs/approaches. I will update here if we get to that point.