Beam is a good tool for streaming the data processing pipeline, and works well for large-scale problems. Thus, we may use beam for two things:
Revise the implementation of the scripts inside nlp_process to use Beam
Also, use beam to accelerate data filtering and mining, which can help us crawl larger high-quality datasets and accelerate the data preparation stage of NLP pretraining.
Beam is a good tool for streaming the data processing pipeline, and works well for large-scale problems. Thus, we may use beam for two things:
nlp_process
to use Beam