Closed ericinf closed 5 years ago
I guess the best solution is to implement a generator which cloud flow_from_directory
like Keras ImageDataGenerator, but uncortuantelly we do not have it.
The idea is to process your data into some format, such as a big DataPack
, store it locally, then implement a customized generator to feed 2^n instances per iter on the fly.. Since your data is so large and not possible to load into memory, harddisk is the natural choice, then you need to generate training instances from directory.
Besides, I have no clue whether matchzoo preprocessor is fast enough to process such a huge amount of data in a resonable time...please take into consideration..
Hi, Thanks a lot. I think modifying the DataGenerator might take some time and might result in some bugs :). We just want to quickly run this DRMM as a baseline on our data. I come up with an idea, is there a functionality of resuming training? I.e. I first load the first DataPack and train (setting epoch=1), at the end I save the model to /path_to_model and then I load the 2nd DataPack and load the trained model (which is trained thru 1st DataPack) from /path_to_model and resume training (i.e. not re-init all the params from scratch)? During the whole process I will maintain another /best_model which is saved if the validation at any time is the best.
In this way I can train iteratively thru the DataPacks on a high-level logic still maintaining your model as a blackbox (not touching the generator to avoid new problems). Thanks a lot.
@ericinf model.fit
preserves model parameters, so as long as there's no global information needed for preprocessing, you don't need to pack everything into a single data pack. You could train on data packs one by one.
e.g.
for data_file in data_files:
data_pack = mz.pack(data_file)
model.fit(*preprocessor.transform(data_pack).unpack())
I hope things are working well for you now. I’ll go ahead and close this issue, but I’m happy to continue further discussion whenever needed.
Describe the Question
Please provide a clear and concise description of what the question is. I have pretty huge training data packages (each of several GBs, totally ~100GB) which could not be combined into 1 DataPack and loaded into memory. Is it possible to pack N small DataPacks for train and load and feed them iteratively to the model during training? Can I just write a for loop to load them iteratively, like: for epoch in range(N_epochs): for i in N_packs: train_generator_i = mz.DataGenerator(train_dp_processed_i, mode='pair', ....) history_i = model.fit_generator(train_generator_i, epoch=1) Is there a better way to do that? Thanks
Describe your attempts
You may also provide a Minimal, Complete, and Verifiable example you tried as a workaround, or StackOverflow solution that you have walked through. (e.g. cosmic radiation).
In addition, figure out your MatchZoo version by running
import matchzoo; matchzoo.__version__
. If this gives you an error, then you're probably using1.0
, and1.0
is no longer supported. Then attach the corresponding label on the issue.