compare main competing recurrent architecture (e.g. word versus char level), cache models etc. Have these choices reviewed by colleagues beforehand to make sure that our choices are good, because we won't have the freedom to change them afterwards
Don't give users a button for model choice, but have n models generate suggestions simultaneously (in parallel), so that we could actually compare user preferences under the exact same conditions (e.g. context)