allenai / bilm-tf

Tensorflow implementation of contextualized word representations from bi-directional language models
Apache License 2.0
1.62k stars 452 forks source link

Data repeated loading in a dead loop #88

Closed leon-cas closed 5 years ago

leon-cas commented 6 years ago

In file bilm/training.py: 'train() function line 837~838', i found the code meets a dead loop : data_gen = data.iter_batches(batch_size * n_gpus, unroll_steps) for batch_no, batch in enumerate(data_gen, start=1):

And the 'iter-batches' call 'get_sentence()' function in bilm/data.py: :

def get_sentence(self):
    while True:
        if self._i == self._nids:  # already return all data in a shard, reload a new one
            self._ids = self._load_random_shard()
        ret = self._ids[self._i]
        self._i += 1
        yield ret

This function seems to meet a dead loop. It reload a new dataset after return all data in a shard, and repeated again until OOM.

The detailed log is: 2018-08-24 13:03:45.952429: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA 2018-08-24 13:04:08,779 : WARNING : Error encountered when serializing lstm_output_embeddings. Type is unsupported, or the types of the items don't match field type in CollectionDef. 'list' object has no attribute 'name' Training for 5 epochs and 705905 batches 2018-08-24 13:04:12,506 : INFO : iteration batches: batch_size=256, n_gpus=3, unroll_steps=20... 2018-08-24 13:05:04,869 : INFO : # of left shards: 0 2018-08-24 13:05:04,869 : INFO : Loading data from: /data/qiwen/deeptext/elmo/data/qiyi_token.corpus.train-02-of-95, reverse=False 2018-08-24 13:08:27,316 : INFO : Loaded 2241113 sentences. 2018-08-24 13:08:31,843 : INFO : # of left shards: 0 2018-08-24 13:08:31,843 : INFO : Loading data from: /data/qiwen/deeptext/elmo/data/qiyi_token.corpus.train-01-of-95, reverse=True 2018-08-24 13:12:04,657 : INFO : Loaded 2241113 sentences. 2018-08-24 13:13:02,833 : INFO : set shards and shuffle... 2018-08-24 13:13:02,833 : INFO : # of left shards: 1 2018-08-24 13:13:02,833 : INFO : Loading data from: /data/qiwen/deeptext/elmo/data/qiyi_token.corpus.train-02-of-95, reverse=False 2018-08-24 13:16:27,632 : INFO : Loaded 2241113 sentences. 2018-08-24 13:16:31,830 : INFO : set shards and shuffle... 2018-08-24 13:16:31,830 : INFO : # of left shards: 1 2018-08-24 13:16:31,831 : INFO : Loading data from: /data/qiwen/deeptext/elmo/data/qiyi_token.corpus.train-01-of-95, reverse=True 2018-08-24 13:20:00,729 : INFO : Loaded 2241113 sentences. 2018-08-24 13:21:00,800 : INFO : # of left shards: 0 2018-08-24 13:21:00,800 : INFO : Loading data from: /data/qiwen/deeptext/elmo/data/qiyi_token.corpus.train-02-of-95, reverse=True 2018-08-24 13:24:33,750 : INFO : Loaded 2241113 sentences. 2018-08-24 13:24:38,841 : INFO : # of left shards: 0 2018-08-24 13:24:38,842 : INFO : Loading data from: /data/qiwen/deeptext/elmo/data/qiyi_token.corpus.train-01-of-95, reverse=False 2018-08-24 13:28:03,307 : INFO : Loaded 2241113 sentences.

ghost commented 5 years ago

@leon-cas I have the same issue, I tried to limit the number of iter with n_batch_total. And in the "for batch_no, batch in enumerate(data_gen, start=1):" forloop, there is a stop point. but if you run the code in pyhon2, the enumerate has the different behavior from which in python3. It will have the dead loop.

matt-peters commented 5 years ago

Python 2 isn't supported, and behavior may be incorrect or undefined.

xixiaoyao commented 5 years ago

this is because of the inconsistent behavior of zip(...) in py2 and py3. In py3, the returned value of zip(...) is a generator while in py2 is a list. So the zip(...) in class BidirectionalLMDataset (located in data.py) could lead to walk over entire dataset again and again.

To verify this, you could train single-directional LM because no zip(...) in class LMDataset

shaneding commented 4 years ago

Currently experiencing this problem, is there a fix for this?