Lyken17 / Efficient-PyTorch

My best practice of training large dataset using PyTorch.
1.09k stars 138 forks source link

Large memory occupation #17

Open gathierry opened 4 years ago

gathierry commented 4 years ago

Hi, I'm training faster-rcnn on 4 gpus with coco dataset converted to LMDB. I used num_worker=4 for the dataloader and I found that the memory occupation is almost 60Gb. I suspect that the whole dataset is read into memory. But per your description in readme,

Here I choose lmdb because

  1. hdf5 pth n5, though with a straightforward json-like API, require to put the whole file into memory. This is not practicle when you play with large dataset like imagenet.

LMDB shouldn't perform like this. Any thought about this? I can share part of my dataset code

class LMDBWrapper(object):
    def __init__(self, lmdb_path):
        self.env = lmdb.open(lmdb_path, max_readers=1, 
                             subdir=os.path.isdir(lmdb_path),
                             readonly=True, lock=False,
                             readahead=False, meminit=False)
        with self.env.begin(write=False) as txn:
            self.length = pa.deserialize(txn.get(b'__len__'))
            self.keys = pa.deserialize(txn.get(b'__keys__'))

    def get_image(self, image_key):
        env = self.env
        with env.begin(write=False) as txn:
            byteflow = txn.get(u'{}'.format(image_key).encode('ascii'))
        imgbuf = pa.deserialize(byteflow)
        buf = six.BytesIO()
        buf.write(imgbuf)
        buf.seek(0)
        image = Image.open(buf).convert('RGB')

        return np.asarray(image)

class LMDBDataset(Dataset):
    def __init__(self, lmdb_path):
        self.lmdb = None
        self.lmdb_path = lmdb_path

    def init_lmdb(self):
        self.lmdb = LMDBWrapper(self.lmdb_path)

    def __getitem__(self, idx):
        if self.lmdb is None:
            self.init_lmdb()
class CocoInstanceLMDBDataset(LMDBDataset):
    def __init__(self, lmdb_path):
        super().__init__(lmdb_path=lmdb_path)

    def __getitem__(self, idx):
        super().__getitem__(idx)
        ann = self.filtered_anns[idx]
        data = dict()
        # transforms
        return data
xieydd commented 4 years ago

Same Problem @Lyken17

Lyken17 commented 4 years ago

@xieydd @gathierry can u share the version of your torch and py-lmdb?

gathierry commented 4 years ago

In my case, torch==1.4.0+cu92 and lmdb==0.98

guhyunkim commented 4 years ago

I have similar problem. I have used 'ImageFolderLMDB' function in folder2lmdb.py and during iteration of dataloader, ram usage continuously increased. Problem may be caused by "txn.get(self.keys[index])", but i don't know how to fix it.

Lyken17 commented 4 years ago

I did a simple test using imagenet dataset, however, I failed to observe any memory leakage

from folder2lmdb import ImageFolderLMDB

dst = ImageFolderLMDB(
    "/ImageNet/train.lmdb",
    transform=transforms.Compose([
        transforms.CenterCrop(224),
        transforms.ToTensor()
    ]))
train_loader = torch.utils.data.DataLoader(
        dst, batch_size=64, num_workers=40, pin_memory=True)

for i, _ in enumerate(train_loader):
    if i % 10 == 0:
        print("[%d/%d]" % (i, len(train_loader)))

The memory usage showed in htop does not increase over time. image

Though I notice there are some issues mentioning this https://github.com/pytorch/vision/issues/619, could you provide more detailed settings (e.g., a sample snippet that leads to memory leak)?

Lyken17 commented 4 years ago

Maybe you need to remove the param max_readers=1?

gathierry commented 4 years ago

I tried without max_readers=1 but it doesn't change anything. Do you think it's because started the program with mp.spawn so that it's run in a multiprocess context?