castorini / castor

PyTorch deep learning models for text processing
http://castor.ai/
Apache License 2.0
178 stars 58 forks source link

Fix insane memory usage when loading datasets #165

Open daemon opened 5 years ago

daemon commented 5 years ago

@achyudhk reports that CharCNN on some dataset uses 63GB of RAM (Hydra and Dragon both have 64GB). I think a solution would be some mechanism for moving data between disk and RAM when needed?

achyudh commented 5 years ago

For now this is something specific to CharCNN due to the large size of the character quantized matrices. But in general I feel it's better to have a streaming approach to loading the dataset from disk and preprocessing it rather than caching all of it in memory.

Ashutosh-Adhikari commented 5 years ago

@achyudhk , I think after 10th Dec, we can fix this issue, given that we have been using the repo for quite a while now. SG? @daemon , can you assign us to the issue? Coz even HAN was facing a similar issue but not as alarming as CharCNN perhaps.