Lyken17 / Efficient-PyTorch

My best practice of training large dataset using PyTorch.
1.09k stars 138 forks source link

Why open transactions (txn) repeatedly? #22

Open cnlinxi opened 4 years ago

cnlinxi commented 4 years ago
    txn = db.begin(write=True)
    for idx, data in enumerate(data_loader):
        # print(type(data), data)
        image, label = data[0]
        txn.put(u'{}'.format(idx).encode('ascii'), dumps_pyarrow((image, label)))
        if idx % write_frequency == 0:
            print("[%d/%d]" % (idx, len(data_loader)))
            txn.commit()
            txn = db.begin(write=True)

Here you repeatedly commit the data and re-open the transaction to prevent the file from becoming too large? Is this necessary? In practice, I do not find that LMDB is crashed because of too much memory, but it is possible that the dataset I used is too small. I'm just very strange. The code here looks too wierd.

这里你重复提交数据并重新打开事务,这是为了防止文件过大吗?这是否是有必要的呢?我在实践中并没有发现lmdb因为内存过大而崩溃,但是也有可能我使用的数据集过小。 我只是非常奇怪,毕竟此处的代码看起来太难受了。

Lyken17 commented 4 years ago

I remember this is a snippet I found somewhere from stackoverflow since directly commit will lead to crash on some versions.

If you can help test on large scale dataset (e.g., imagenet) to ensure that current lmdb works smoothly, I think we can safely remove the last line.