NTMC-Community / MatchZoo

Facilitating the design, comparison and sharing of deep text matching models.
Apache License 2.0
3.83k stars 899 forks source link

How to load several training data packs iteratively (not all combined in one data pack) #739

Closed ericinf closed 5 years ago

ericinf commented 5 years ago

Describe the Question

Please provide a clear and concise description of what the question is. I have pretty huge training data packages (each of several GBs, totally ~100GB) which could not be combined into 1 DataPack and loaded into memory. Is it possible to pack N small DataPacks for train and load and feed them iteratively to the model during training? Can I just write a for loop to load them iteratively, like: for epoch in range(N_epochs): for i in N_packs: train_generator_i = mz.DataGenerator(train_dp_processed_i, mode='pair', ....) history_i = model.fit_generator(train_generator_i, epoch=1) Is there a better way to do that? Thanks

Describe your attempts

You may also provide a Minimal, Complete, and Verifiable example you tried as a workaround, or StackOverflow solution that you have walked through. (e.g. cosmic radiation).

In addition, figure out your MatchZoo version by running import matchzoo; matchzoo.__version__. If this gives you an error, then you're probably using 1.0, and 1.0 is no longer supported. Then attach the corresponding label on the issue.

bwanglzu commented 5 years ago

I guess the best solution is to implement a generator which cloud flow_from_directory like Keras ImageDataGenerator, but uncortuantelly we do not have it.

The idea is to process your data into some format, such as a big DataPack, store it locally, then implement a customized generator to feed 2^n instances per iter on the fly.. Since your data is so large and not possible to load into memory, harddisk is the natural choice, then you need to generate training instances from directory.

Besides, I have no clue whether matchzoo preprocessor is fast enough to process such a huge amount of data in a resonable time...please take into consideration..

bwanglzu commented 5 years ago

take a look at Dask

ericinf commented 5 years ago

Hi, Thanks a lot. I think modifying the DataGenerator might take some time and might result in some bugs :). We just want to quickly run this DRMM as a baseline on our data. I come up with an idea, is there a functionality of resuming training? I.e. I first load the first DataPack and train (setting epoch=1), at the end I save the model to /path_to_model and then I load the 2nd DataPack and load the trained model (which is trained thru 1st DataPack) from /path_to_model and resume training (i.e. not re-init all the params from scratch)? During the whole process I will maintain another /best_model which is saved if the validation at any time is the best.

In this way I can train iteratively thru the DataPacks on a high-level logic still maintaining your model as a blackbox (not touching the generator to avoid new problems). Thanks a lot.

uduse commented 5 years ago

@ericinf model.fit preserves model parameters, so as long as there's no global information needed for preprocessing, you don't need to pack everything into a single data pack. You could train on data packs one by one.

e.g.

for data_file in data_files:
    data_pack = mz.pack(data_file)
    model.fit(*preprocessor.transform(data_pack).unpack())
uduse commented 5 years ago

I hope things are working well for you now. I’ll go ahead and close this issue, but I’m happy to continue further discussion whenever needed.