Lyken17 / Efficient-PyTorch

My best practice of training large dataset using PyTorch.
1.08k stars 139 forks source link

不使用DDP,只用lmdb,速度很慢,比原始imread还慢 #20

Open Edwardmark opened 4 years ago

Edwardmark commented 4 years ago
def folder2lmdb(anno_file, name="train", write_frequency=5000, num_workers=16):
    ids = []
    annotation = []
    for line in open(anno_file,'r'):
        filename = line.strip().split()[0]
        ids.append(filename)
        annotation.append(line.strip().split()[1:])
    lmdb_path = osp.join("app_%s.lmdb" % name)
    isdir = os.path.isdir(lmdb_path)

    print("Generate LMDB to %s" % lmdb_path)
    db = lmdb.open(lmdb_path, subdir=isdir,
                   map_size=1099511627776 * 2, readonly=False,
                   meminit=False, map_async=True)

    print(len(ids), len(annotation))
    txn = db.begin(write=True)
    idx = 0
    for filename, label in zip(ids, annotation):
        print(filename, label)
        image = raw_reader(filename)
        txn.put(u'{}'.format(idx).encode('ascii'), dumps_pyarrow((image, label)))
        if idx % write_frequency == 0:
            print("[%d/%d]" % (idx, len(annotation)))
            txn.commit()
            txn = db.begin(write=True)
        idx += 1

    # finish iterating through dataset
    txn.commit()
    keys = [u'{}'.format(k).encode('ascii') for k in range(idx + 1)]
    with db.begin(write=True) as txn:
        txn.put(b'__keys__', dumps_pyarrow(keys))
        txn.put(b'__len__', dumps_pyarrow(len(keys)))

    print("Flushing database ...")
    db.sync()
    db.close()

class DetectionLMDB(data.Dataset):
    def __init__(self, db_path, transform=None, target_transform=None, dataset_name='WiderFace'):
        self.db_path = db_path
        self.env = lmdb.open(db_path, subdir=osp.isdir(db_path),
                             readonly=True, lock=False,
                             readahead=False, meminit=False)
        with self.env.begin(write=False) as txn:
            # self.length = txn.stat()['entries'] - 1
            self.length =pa.deserialize(txn.get(b'__len__'))
            self.keys= pa.deserialize(txn.get(b'__keys__'))

        self.transform = transform
        self.target_transform = target_transform

        self.name = dataset_name
        self.annotation = list()
        self.counter = 0

    def __getitem__(self, index):
        im, gt, h, w = self.pull_item(index)
        return im, gt

    def pull_item(self, index):
        img, target = None, None
        env = self.env
        with env.begin(write=False) as txn:
            byteflow = txn.get(self.keys[index])
        unpacked = pa.deserialize(byteflow)

        # load image
        imgbuf = unpacked[0]
        buf = six.BytesIO()
        buf.write(imgbuf)
        buf.seek(0)
        img = Image.open(buf).convert('RGB')
        img = cv2.cvtColor(np.asarray(img),cv2.COLOR_RGB2BGR)  
        height, width, channels = img.shape
        # load label
        target = unpacked[1]

        if self.target_transform is not None:
            target = self.target_transform(target, width, height)

        if self.transform is not None:
            target = np.array(target)
            img, boxes, labels, poses, angles = self.transform(img, target[:, :4], target[:, 4], target[:,5], target[:,6])
            target = np.hstack((boxes, np.expand_dims(labels, axis=1),
                                       np.expand_dims(poses, axis=1),
                                       np.expand_dims(angles, axis=1)))

        return torch.from_numpy(img).permute(2, 0, 1), target, height, width

    def __len__(self):
        return self.length

    def __repr__(self):
        return self.__class__.__name__ + ' (' + self.db_path + ')'

使用上述代码生成lmdb并用DetectionLMDB作为dataset,速度很慢,不知道为啥,是不是必须跟DDP混合使用呢?

codermckee commented 3 years ago

我也有同样的问题,比自带的Dataset慢,请问你后来解决了吗?

Lyken17 commented 3 years ago

这个 scripts 是很早以前(~ torch 0.4)的遗产了,不知道现在 dataloader 有没有什么改动,我来测试下看看。

leijuzi commented 3 years ago

我一开始的时候把预处理的特征存进去了,特征太大了所以加载的时候慢,后来把单张图片的buffer存进去就很快

lizhenstat commented 2 years ago

@Lyken17 Hi, I first tried using pytorch-1.10(cuda10.2, python3.8) on one GPU(1080ti), it is too slow, the log is as follows

Epoch: [0][0/10010]     Time 15.811 (15.811)    Data 14.544 (14.544)    Loss 7.0312 (7.0312)    Acc@1 0.000 (0.000)     Acc@5 0.781 (0.781)
Epoch: [0][10/10010]    Time 0.213 (4.024)      Data 0.000 (3.770)      Loss 7.3495 (7.1619)    Acc@1 0.000 (0.213)     Acc@5 0.000 (0.497)
Epoch: [0][20/10010]    Time 10.217 (4.017)     Data 10.129 (3.817)     Loss 7.2931 (7.2333)    Acc@1 0.000 (0.223)     Acc@5 0.000 (0.595)
Epoch: [0][30/10010]    Time 0.213 (3.740)      Data 0.000 (3.556)      Loss 7.0012 (7.1996)    Acc@1 0.000 (0.176)     Acc@5 0.000 (0.580)
Epoch: [0][40/10010]    Time 8.978 (3.729)      Data 8.890 (3.556)      Loss 7.0080 (7.1619)    Acc@1 0.781 (0.210)     Acc@5 0.781 (0.534)
Epoch: [0][50/10010]    Time 0.220 (3.661)      Data 0.000 (3.492)      Loss 6.9565 (7.1282)    Acc@1 0.000 (0.199)     Acc@5 0.000 (0.597)
Epoch: [0][60/10010]    Time 7.797 (3.635)      Data 7.710 (3.471)      Loss 6.9137 (7.0951)    Acc@1 0.000 (0.218)     Acc@5 0.000 (0.602)
Epoch: [0][70/10010]    Time 0.214 (3.665)      Data 0.000 (3.503)      Loss 6.9065 (7.0728)    Acc@1 0.000 (0.220)     Acc@5 0.000 (0.572)
Epoch: [0][80/10010]    Time 7.347 (3.636)      Data 7.260 (3.477)      Loss 6.8719 (7.0524)    Acc@1 0.000 (0.212)     Acc@5 0.781 (0.637)
Epoch: [0][90/10010]    Time 0.216 (3.590)      Data 0.000 (3.431)      Loss 6.9107 (7.0356)    Acc@1 0.000 (0.206)     Acc@5 0.781 (0.687)
Epoch: [0][100/10010]   Time 9.313 (3.629)      Data 9.219 (3.473)      Loss 6.9006 (7.0217)    Acc@1 0.000 (0.217)     Acc@5 0.000 (0.696)
Epoch: [0][110/10010]   Time 0.212 (3.577)      Data 0.000 (3.421)      Loss 6.8484 (7.0093)    Acc@1 0.000 (0.211)     Acc@5 2.344 (0.739)
Epoch: [0][120/10010]   Time 11.809 (3.600)     Data 11.722 (3.445)     Loss 6.8965 (6.9977)    Acc@1 0.781 (0.213)     Acc@5 1.562 (0.781)
Epoch: [0][130/10010]   Time 0.215 (3.534)      Data 0.000 (3.379)      Loss 6.8403 (6.9883)    Acc@1 0.000 (0.209)     Acc@5 0.000 (0.805)
Epoch: [0][140/10010]   Time 11.093 (3.551)     Data 11.000 (3.400)     Loss 6.9016 (6.9800)    Acc@1 0.000 (0.199)     Acc@5 0.000 (0.803)
Epoch: [0][150/10010]   Time 4.364 (3.523)      Data 4.276 (3.373)      Loss 6.8721 (6.9722)    Acc@1 0.000 (0.191)     Acc@5 0.000 (0.771)
Epoch: [0][160/10010]   Time 9.092 (3.525)      Data 9.004 (3.375)      Loss 6.8635 (6.9640)    Acc@1 0.000 (0.199)     Acc@5 0.781 (0.791)
Epoch: [0][170/10010]   Time 5.724 (3.507)      Data 5.637 (3.359)      Loss 6.8689 (6.9573)    Acc@1 0.000 (0.201)     Acc@5 0.781 (0.777)
Epoch: [0][180/10010]   Time 9.218 (3.506)      Data 9.124 (3.360)      Loss 6.7048 (6.9496)    Acc@1 0.781 (0.207)     Acc@5 3.125 (0.803)
Epoch: [0][190/10010]   Time 3.789 (3.481)      Data 3.700 (3.335)      Loss 6.8398 (6.9441)    Acc@1 0.000 (0.209)     Acc@5 0.000 (0.826)
Epoch: [0][200/10010]   Time 11.521 (3.492)     Data 11.433 (3.347)     Loss 6.8196 (6.9367)    Acc@1 0.000 (0.218)     Acc@5 0.000 (0.875)
Epoch: [0][210/10010]   Time 1.611 (3.465)      Data 1.523 (3.321)      Loss 6.7499 (6.9297)    Acc@1 2.344 (0.233)     Acc@5 2.344 (0.896)
Epoch: [0][220/10010]   Time 11.472 (3.480)     Data 11.383 (3.337)     Loss 6.7838 (6.9230)    Acc@1 0.781 (0.255)     Acc@5 1.562 (0.937)
Epoch: [0][230/10010]   Time 0.212 (3.443)      Data 0.000 (3.299)      Loss 6.8092 (6.9169)    Acc@1 0.000 (0.257)     Acc@5 0.781 (0.944)
Epoch: [0][240/10010]   Time 10.698 (3.472)     Data 10.610 (3.328)     Loss 6.8725 (6.9105)    Acc@1 0.000 (0.253)     Acc@5 0.000 (0.969)
Epoch: [0][250/10010]   Time 0.217 (3.451)      Data 0.000 (3.307)      Loss 6.8506 (6.9055)    Acc@1 0.000 (0.246)     Acc@5 0.000 (0.980)
Epoch: [0][260/10010]   Time 9.317 (3.456)      Data 9.229 (3.312)      Loss 6.7118 (6.9010)    Acc@1 0.000 (0.263)     Acc@5 1.562 (0.988)
Epoch: [0][270/10010]   Time 0.212 (3.439)      Data 0.000 (3.295)      Loss 6.7731 (6.8963)    Acc@1 0.781 (0.277)     Acc@5 1.562 (1.038)
Epoch: [0][280/10010]   Time 11.279 (3.458)     Data 11.191 (3.314)     Loss 6.8488 (6.8909)    Acc@1 0.000 (0.286)     Acc@5 0.781 (1.054)
Epoch: [0][290/10010]   Time 0.214 (3.436)      Data 0.000 (3.292)      Loss 6.7565 (6.8860)    Acc@1 0.000 (0.290)     Acc@5 0.781 (1.079)
Epoch: [0][300/10010]   Time 12.405 (3.458)     Data 12.317 (3.313)     Loss 6.7233 (6.8805)    Acc@1 0.000 (0.298)     Acc@5 1.562 (1.121)
Epoch: [0][310/10010]   Time 0.213 (3.426)      Data 0.000 (3.282)      Loss 6.7484 (6.8755)    Acc@1 0.000 (0.306)     Acc@5 2.344 (1.156)
Epoch: [0][320/10010]   Time 13.653 (3.442)     Data 13.559 (3.298)     Loss 6.7439 (6.8712)    Acc@1 0.000 (0.309)     Acc@5 1.562 (1.173)
Epoch: [0][330/10010]   Time 0.212 (3.418)      Data 0.000 (3.273)      Loss 6.7267 (6.8670)    Acc@1 0.781 (0.314)     Acc@5 2.344 (1.204)
Epoch: [0][340/10010]   Time 13.209 (3.435)     Data 13.121 (3.289)     Loss 6.7553 (6.8636)    Acc@1 1.562 (0.314)     Acc@5 1.562 (1.210)
Epoch: [0][350/10010]   Time 0.218 (3.412)      Data 0.000 (3.265)      Loss 6.6885 (6.8588)    Acc@1 0.000 (0.318)     Acc@5 3.125 (1.249)
Epoch: [0][360/10010]   Time 12.825 (3.427)     Data 12.731 (3.280)     Loss 6.6241 (6.8540)    Acc@1 0.000 (0.316)     Acc@5 2.344 (1.260)
Epoch: [0][370/10010]   Time 0.213 (3.408)      Data 0.000 (3.260)      Loss 6.8046 (6.8504)    Acc@1 0.000 (0.322)     Acc@5 1.562 (1.289)
Epoch: [0][380/10010]   Time 11.702 (3.415)     Data 11.615 (3.267)     Loss 6.7234 (6.8459)    Acc@1 1.562 (0.334)     Acc@5 2.344 (1.312)
Epoch: [0][390/10010]   Time 0.218 (3.399)      Data 0.000 (3.249)      Loss 6.7012 (6.8410)    Acc@1 0.000 (0.342)     Acc@5 2.344 (1.343)
Epoch: [0][400/10010]   Time 12.231 (3.413)     Data 12.144 (3.264)     Loss 6.7159 (6.8370)    Acc@1 0.000 (0.343)     Acc@5 1.562 (1.356)
Epoch: [0][410/10010]   Time 0.213 (3.396)      Data 0.000 (3.245)      Loss 6.5088 (6.8320)    Acc@1 0.000 (0.348)     Acc@5 3.125 (1.382)
Epoch: [0][420/10010]   Time 12.972 (3.407)     Data 12.883 (3.256)     Loss 6.6504 (6.8275)    Acc@1 0.781 (0.349)     Acc@5 4.688 (1.403)
Epoch: [0][430/10010]   Time 0.212 (3.393)      Data 0.000 (3.242)      Loss 6.6490 (6.8246)    Acc@1 0.000 (0.352)     Acc@5 3.906 (1.434)
Epoch: [0][440/10010]   Time 11.984 (3.406)     Data 11.896 (3.255)     Loss 6.7207 (6.8209)    Acc@1 0.781 (0.358)     Acc@5 1.562 (1.465)
Epoch: [0][450/10010]   Time 0.212 (3.387)      Data 0.000 (3.235)      Loss 6.5495 (6.8161)    Acc@1 0.000 (0.357)     Acc@5 0.000 (1.483)
Epoch: [0][460/10010]   Time 11.841 (3.396)     Data 11.748 (3.244)     Loss 6.6327 (6.8123)    Acc@1 0.781 (0.364)     Acc@5 4.688 (1.527)
Epoch: [0][470/10010]   Time 0.212 (3.383)      Data 0.000 (3.231)      Loss 6.5489 (6.8081)    Acc@1 0.781 (0.370)     Acc@5 7.031 (1.558)
Epoch: [0][480/10010]   Time 8.418 (3.389)      Data 8.331 (3.237)      Loss 6.6245 (6.8034)    Acc@1 0.781 (0.377)     Acc@5 1.562 (1.569)
Epoch: [0][490/10010]   Time 0.211 (3.388)      Data 0.000 (3.237)      Loss 6.6849 (6.7994)    Acc@1 1.562 (0.380)     Acc@5 2.344 (1.593)
Epoch: [0][500/10010]   Time 6.984 (3.388)      Data 6.890 (3.237)      Loss 6.4890 (6.7949)    Acc@1 0.781 (0.379)     Acc@5 3.906 (1.616)
Epoch: [0][510/10010]   Time 0.212 (3.391)      Data 0.000 (3.239)      Loss 6.6416 (6.7910)    Acc@1 0.781 (0.382)     Acc@5 2.344 (1.642)
Epoch: [0][520/10010]   Time 2.660 (3.382)      Data 2.572 (3.231)      Loss 6.5715 (6.7870)    Acc@1 0.781 (0.385)     Acc@5 1.562 (1.660)
Epoch: [0][530/10010]   Time 0.212 (3.388)      Data 0.000 (3.236)      Loss 6.5645 (6.7825)    Acc@1 0.781 (0.393)     Acc@5 2.344 (1.680)
Epoch: [0][540/10010]   Time 1.908 (3.379)      Data 1.820 (3.228)      Loss 6.4077 (6.7779)    Acc@1 2.344 (0.394)     Acc@5 3.906 (1.692)
Epoch: [0][550/10010]   Time 0.213 (3.381)      Data 0.000 (3.230)      Loss 6.5599 (6.7736)    Acc@1 0.000 (0.397)     Acc@5 0.781 (1.704)
Epoch: [0][560/10010]   Time 0.856 (3.369)      Data 0.768 (3.218)      Loss 6.6386 (6.7695)    Acc@1 0.781 (0.401)     Acc@5 1.562 (1.732)
Epoch: [0][570/10010]   Time 0.229 (3.377)      Data 0.000 (3.226)      Loss 6.5827 (6.7652)    Acc@1 0.781 (0.409)     Acc@5 3.125 (1.760)
Epoch: [0][580/10010]   Time 0.975 (3.364)      Data 0.887 (3.213)      Loss 6.4518 (6.7610)    Acc@1 0.781 (0.413)     Acc@5 5.469 (1.779)
Epoch: [0][590/10010]   Time 0.212 (3.370)      Data 0.000 (3.219)      Loss 6.5656 (6.7565)    Acc@1 0.000 (0.428)     Acc@5 2.344 (1.823)
Epoch: [0][600/10010]   Time 0.212 (3.355)      Data 0.046 (3.203)      Loss 6.4239 (6.7520)    Acc@1 0.781 (0.437)     Acc@5 3.125 (1.851)
Epoch: [0][610/10010]   Time 0.211 (3.363)      Data 0.000 (3.212)      Loss 6.3226 (6.7474)    Acc@1 1.562 (0.445)     Acc@5 6.250 (1.880)
Epoch: [0][620/10010]   Time 0.214 (3.350)      Data 0.000 (3.198)      Loss 6.5112 (6.7432)    Acc@1 1.562 (0.452)     Acc@5 5.469 (1.906)
Epoch: [0][630/10010]   Time 0.226 (3.354)      Data 0.000 (3.201)      Loss 6.4474 (6.7382)    Acc@1 0.781 (0.458)     Acc@5 3.125 (1.946)
Epoch: [0][640/10010]   Time 0.211 (3.341)      Data 0.000 (3.188)      Loss 6.5718 (6.7347)    Acc@1 0.781 (0.463)     Acc@5 3.906 (1.967)
Epoch: [0][650/10010]   Time 0.214 (3.347)      Data 0.000 (3.194)      Loss 6.5053 (6.7297)    Acc@1 0.781 (0.472)     Acc@5 1.562 (2.008)
Epoch: [0][660/10010]   Time 0.212 (3.343)      Data 0.000 (3.189)      Loss 6.3718 (6.7246)    Acc@1 0.781 (0.482)     Acc@5 3.906 (2.044)
Epoch: [0][670/10010]   Time 0.223 (3.366)      Data 0.000 (3.212)      Loss 6.3855 (6.7196)    Acc@1 0.781 (0.496)     Acc@5 3.906 (2.095)
Epoch: [0][680/10010]   Time 0.212 (3.358)      Data 0.000 (3.204)      Loss 6.5520 (6.7149)    Acc@1 0.781 (0.507)     Acc@5 3.906 (2.129)
Epoch: [0][690/10010]   Time 0.212 (3.370)      Data 0.000 (3.216)      Loss 6.3960 (6.7098)    Acc@1 2.344 (0.510)     Acc@5 7.031 (2.156)
Epoch: [0][700/10010]   Time 0.214 (3.360)      Data 0.000 (3.205)      Loss 6.4797 (6.7055)    Acc@1 0.781 (0.519)     Acc@5 2.344 (2.190)
Epoch: [0][710/10010]   Time 0.227 (3.368)      Data 0.000 (3.212)      Loss 6.3497 (6.7008)    Acc@1 3.125 (0.531)     Acc@5 4.688 (2.217)
Epoch: [0][720/10010]   Time 0.213 (3.358)      Data 0.000 (3.203)      Loss 6.3555 (6.6961)    Acc@1 2.344 (0.543)     Acc@5 6.250 (2.256)
Epoch: [0][730/10010]   Time 0.207 (3.376)      Data 0.000 (3.220)      Loss 6.5028 (6.6923)    Acc@1 0.000 (0.544)     Acc@5 2.344 (2.267)
Epoch: [0][740/10010]   Time 0.210 (3.365)      Data 0.000 (3.209)      Loss 6.2173 (6.6880)    Acc@1 2.344 (0.557)     Acc@5 5.469 (2.313)
Epoch: [0][750/10010]   Time 0.209 (3.372)      Data 0.000 (3.215)      Loss 6.5205 (6.6841)    Acc@1 0.000 (0.564)     Acc@5 2.344 (2.335)
Epoch: [0][760/10010]   Time 0.209 (3.359)      Data 0.000 (3.202)      Loss 6.2149 (6.6788)    Acc@1 1.562 (0.571)     Acc@5 6.250 (2.367)
Epoch: [0][770/10010]   Time 0.209 (3.364)      Data 0.000 (3.207)      Loss 6.4612 (6.6749)    Acc@1 1.562 (0.586)     Acc@5 3.906 (2.403)
Epoch: [0][780/10010]   Time 0.208 (3.353)      Data 0.000 (3.196)      Loss 6.3526 (6.6705)    Acc@1 0.000 (0.598)     Acc@5 3.906 (2.439)
Epoch: [0][790/10010]   Time 0.210 (3.359)      Data 0.000 (3.202)      Loss 6.2106 (6.6650)    Acc@1 0.781 (0.607)     Acc@5 3.906 (2.469)
Epoch: [0][800/10010]   Time 0.209 (3.353)      Data 0.000 (3.195)      Loss 6.1517 (6.6601)    Acc@1 3.906 (0.615)     Acc@5 8.594 (2.503)

after I saw this issue, I tried installing pytorch-0.4.1 (python3.6, cuda 9.0)to rerun the code, however I met the following error message.

main.py:87: UserWarning: You have chosen a specific GPU. This will completely disable data parallelism.
  warnings.warn('You have chosen a specific GPU. This will completely '
=> creating model 'resnet18'
Traceback (most recent call last):
  File "main.py", line 344, in <module>
    main()
  File "main.py", line 152, in main
    normalize,
  File "/home/sirius/document/siriusShare/Clustering-Face/arcface-pytorch-master/code/Efficient-PyTorch-master/tools/folder2lmdb.py", line 31, in __init__
    self.length =pa.deserialize(txn.get(b'__len__'))
  File "pyarrow/serialization.pxi", line 458, in pyarrow.lib.deserialize
  File "pyarrow/serialization.pxi", line 420, in pyarrow.lib.deserialize_from
  File "pyarrow/serialization.pxi", line 397, in pyarrow.lib.read_serialized
  File "pyarrow/error.pxi", line 83, in pyarrow.lib.check_status
pyarrow.lib.ArrowIOError: Cannot read a negative number of bytes from BufferReader.

Hi, can you tell us which python, cuda, pytorch, pyarrow version you were using? Thanks very much for your help (I've been spend weeks for solving this problem, I tried hdf5 and DALI before, but they did not solve the the problem. Since the official ImageNet classification training also has GPU utilization of 100%,0%,100%,0%,100%,0%...)