This would cut down pre-processing time, at the expense of having to make sure you're using the right vocabulary files and such. It would probably also make some of the sequence tagging stuff simpler.
This depends on #328, and you would basically have an option in each script to output a pre-indexed file, running the data indexing code and saving the results. Or maybe this would be a stand-alone script that just ran the pre-processing and saved the data indexer... The second option is probably cleaner, and doesn't depend on #328. You'd have to also add an option to TextTrainer that tells it it's loading a pre-indexed dataset, and add a way to save and load IndexedInstances (maybe just pickling them...)
This would cut down pre-processing time, at the expense of having to make sure you're using the right vocabulary files and such. It would probably also make some of the sequence tagging stuff simpler.
This depends on #328, and you would basically have an option in each script to output a pre-indexed file, running the data indexing code and saving the results. Or maybe this would be a stand-alone script that just ran the pre-processing and saved the data indexer... The second option is probably cleaner, and doesn't depend on #328. You'd have to also add an option to
TextTrainer
that tells it it's loading a pre-indexed dataset, and add a way to save and loadIndexedInstances
(maybe just pickling them...)