asyml / texar

Toolkit for Machine Learning, Natural Language Processing, and Text Generation, in TensorFlow. This is part of the CASL project: http://casl-project.ai/
https://asyml.io
Apache License 2.0
2.39k stars 372 forks source link

Is there anyway to train "big data" using transformer? #275

Open zqp2009happy opened 4 years ago

zqp2009happy commented 4 years ago

It sames that Transformer reads training data into the Memory. So it easily got OOM Error with "big training data" like 10G (about 50 million text pairs). Is there some solution for this problem?

ZhitingHu commented 4 years ago

By "Transformer" do you mean the example code under examples/ or the transformer modules in the library?

The transformer modules are independent of how you manage training data (either in memory or disk), as long as you pass to it a data minibatch each iteration.

The transformer example code does load the whole training data into the memory beforehand (code here). To avoid this, you may want to use Texar data module that loads data sequentially. Here is an example use of the Texar data module.