dicook / merops

MIT License
0 stars 0 forks source link

Data quality checks #1

Open dicook opened 7 years ago

dicook commented 7 years ago
pfh commented 7 years ago

On the datathon slack channel, sparker writes: "Hi All, looks like around 3% of the transactions are duplicates. Specifically 1958301 duplicates, leaving 57492484 unique records in total. ... Having done a quick scan it looks like most of the duplicates are from patients in Tasmania (first digit of post code = 7). All Tasmanian patients have all of their transaction duplicated. This only accounts for 87% of the patients with duplicates. I haven't found a common reason for the others."

dicook commented 7 years ago

2016 data is half missing, and before 2011 seems spotty, maybe best to focus attention on the 2011-2015 for analysis

pfh commented 7 years ago

The 2016 data is missing from half the patients because the Kaggle competition is to predict diabetes drug usage in 2016 for these people (patient ids 279201:558352).