Reducing space usage of dataset.pt generated by preprocess.py

dbiir / UER-py

Open Source Pre-training Model Framework in PyTorch & Pre-trained Model Zoo

https://github.com/dbiir/UER-py/wiki

Apache License 2.0

3.01k stars 525 forks source link

Reducing space usage of dataset.pt generated by preprocess.py #303

Open Eric8932 opened 2 years ago

Eric8932 commented 2 years ago

The dataset.pt generated by preprocess.py takes up too much disk space. There is still a lot of room for optimization. For example, PAD in src does not need to be stored in dataset.pt.

I'm looking forward to this optimization and I think this opeimization is important for this project to process very large corpus.