[x] Rethink sample selection and income selection (currently requiring income > 5k each year, including last year user is observed, where we may have incomplete data - adjust this). Work on branch sample_selection_update.
[x] Cleaning and selection operations with table using decorators (from mlbt)
[x] Eliminate duplicate accounts (not necessary for reasonable results, so didn't do this for now. Still ask MDB for list of duplicate accounts, though, and delete them once we have it).
Ensuring linked accounts are used for vital expenses
As a minimal check on whether a user has linked all their accounts, we want to check for a minimum level of monthly essential expenses.
gathergood2020coholding use a minimum of 5 monthly grocery transactions, which seems a neat and simple way to achieve this (though I'd use 4, at least one per week).
Using this criteria on the MDB data removes 3/4 of users from the data. On inspection, it turns out that at least in some cases this is because MDB misses some grocery transactions in their tagging. Because of this, I use a less conservative criteria, requiring a user to spend at least £200 each month.
done:
sample_selection_update
.