Produce training data from generators, as opposed to in a list

nelson-liu commented 7 years ago

I ran into an issue where training a large dataset (Who Did What) with 50d character embeddings would promptly cause the memory usage on my system to explode (60 GB of RAM exhausted). when you use the words and characters tokenizer, you essentially multiply all your data arrays by the maximum word length (in characters). This can become problematic, and the fix is to make the training data come from a generator where we only generate it as needed, thus eliminating the need of storing one big array in memory.

I'm adding this as an issue here so I can track it (/ make sure i don't forget about it). Assigning myself.

nelson-liu commented 7 years ago

so i've implemented this feature in my personal fork of deep_qa, and it involves not using Dataset.pad_instances, but rather simply iterating over each instance in the dataset, then padding it and converting that individual instance to a numpy array before yielding it.

I'll leave this issue open for now, but the core pain point we had that made us want this was SQuAD not fitting in memory for training BiDAF; since that problem was largely solved by #254 , perhaps this isn't as important now...

matt-gardner commented 7 years ago

Fixed by #295.

allenai / deep_qa

Produce training data from generators, as opposed to in a list #188