The dataset.pt generated by preprocess.py takes up too much disk space. There is still a lot of room for optimization. For example, PAD in src does not need to be stored in dataset.pt.
I'm looking forward to this optimization and I think this opeimization is important for this project to process very large corpus.
The dataset.pt generated by preprocess.py takes up too much disk space. There is still a lot of room for optimization. For example, PAD in src does not need to be stored in dataset.pt.
I'm looking forward to this optimization and I think this opeimization is important for this project to process very large corpus.