How to write a Dataset that reduces memory usage and fits in DistributedSampler? - Githubissues

PengNi / ccsmeth

Detecting DNA methylation from PacBio CCS reads

BSD 3-Clause Clear License

73 stars 11 forks source link

How to write a Dataset that reduces memory usage and fits in DistributedSampler? #11

Closed PengNi closed 2 years ago

PengNi commented 2 years ago

iterable-style datasets? Chunkable datasets? https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader https://github.com/pytorch/pytorch/pull/26547

PengNi commented 2 years ago

reduce feature file size? - txt2hdf5? txt2binary? https://towardsdatascience.com/reading-h5-files-faster-with-pytorch-datasets-3ff86938cc

PengNi commented 2 years ago

offsets is currently a good solution? https://github.com/pytorch/text/blob/0b4718d7827b7f278cd3169af7f2587c1f663a27/torchtext/datasets/unsupervised_learning.py https://github.com/pytorch/text/issues/130

PengNi commented 2 years ago

use offests(index?) for now, close this issue