ermongroup / CSDI

Codes for "CSDI: Conditional Score-based Diffusion Models for Probabilistic Time Series Imputation"
MIT License
284 stars 84 forks source link

DEformer-like model achieves 0.216 MAE on 10% missing healthcare dataset #2

Closed airalcorn2 closed 2 years ago

airalcorn2 commented 2 years ago

Congratulations on your paper being accepted to NeurIPS, and thank you for sharing your code! I thought the task as described might be a good fit for a DEformer-like model (hereafter "DEformer-CSDI"), so I decided to run an experiment on the 10% missing healthcare dataset and I thought you might be interested in the results (code here). While my test set is identical to yours, I changed the training/validation split to 95%/5% and I used an online strategy to generate missing values for each training sample. Specifically, every time a training sample was encountered, I randomly selected 10% of the observed values to serve as the missing values.

Like the DEformer, the input for DEformer-CSDI consists of a mix of identity feature vectors and identity/value feature vectors. The difference in this case is that DEformer-CSDI is not learning the joint distribution, so only the identity feature vectors are included for the missing values and the attention mask is now full instead of lower triangular (i.e., every input can attend to every other input). Identity was encoded as f(t, k) = [t, embed(k)] where t and k are the time and feature indices, respectively, for a data point. One interesting difference between DEformer-CSDI and CSDI is that DEformer-CSDI simply ignores missing values that are not being predicted.

With no hyperparameter tuning, DEformer-CSDI achieves a mean absolute error of 0.219 on the 10% missing healthcare dataset. I thought it was notable that DEformer-CSDI outperformed the flattened Transformer baseline from Table 7 by a wide margin. With that being said, DEformer-CSDI is much larger than CSDI (19,250,493 parameters), so it would be interesting to see if CSDI's performance could be improved further using this online sampling strategy.

airalcorn2 commented 2 years ago

I just realized I had a bug in how I was calculating the MAE where I was weighting each sample the same instead of in proportion to the number of missing values that were in the sample. Preliminary results suggest it won't change the final value much, but I'm re-running the experiment now with the bug fixed.

airalcorn2 commented 2 years ago

Indeed, it didn't make much of a difference. I now get a MAE of 0.216.