Arturus / kaggle-web-traffic

1st place solution
MIT License
1.82k stars 667 forks source link

Dealing with sparsity #20

Closed wasd12345 closed 5 years ago

wasd12345 commented 6 years ago

Hi, question about how you dealt with sparsity.

In input_pipe.py, there are parameters like "train_completeness_threshold" which determines how many 0's are allowed. It looks like the default is 1 for this value. Further down in the code, there is: self.max_train_empty = int(round(train_window * (1 - train_completeness_threshold))) So with the default value of 1, this makes max_train_empty default to 0, i.e. the randomly cropped time series must be completely filled [no missing values] in order to be used in training.

So is this what you did to get your best results, you discarded any time series crop which had holes in it?

Of the ~145 thousand time series in train_1.csv, it looks like about 2/3 of them are dense [no missing values], and any random crop of a dense series will remain dense, and a random crop of a series with holes may get a portion that is dense, so I guess even with the max_train_empty = 0 you still get to use most of the data, right?

Arturus commented 5 years ago

Final version has train_completeness_threshold=0.01, it filters out only almost empty series. I found that spareness has regularizing effect for training, so there is no reason to filter out sparse series.