allenai / deep_qa

A deep NLP library, based on Keras / tf, focused on question answering (but useful for other NLP too)
Apache License 2.0
404 stars 132 forks source link

Allow loading already-indexed data #352

Open matt-gardner opened 7 years ago

matt-gardner commented 7 years ago

This would cut down pre-processing time, at the expense of having to make sure you're using the right vocabulary files and such. It would probably also make some of the sequence tagging stuff simpler.

This depends on #328, and you would basically have an option in each script to output a pre-indexed file, running the data indexing code and saving the results. Or maybe this would be a stand-alone script that just ran the pre-processing and saved the data indexer... The second option is probably cleaner, and doesn't depend on #328. You'd have to also add an option to TextTrainer that tells it it's loading a pre-indexed dataset, and add a way to save and load IndexedInstances (maybe just pickling them...)