Closed zhongpeixiang closed 5 years ago
I figured it out how your script works on local machines.
For OpenSubtitles, you assume that the context of a message is the preceding N
sentences. Do you have any plan to use some segmentation tools to segment the turns of each conversation? E.g., Automatic Turn Segmentation for Movie & TV Subtitles.
Thanks, Peixiang
(For future reference, you can run locally using --runner DirectRunner
)
We don't have any current plans to do better turn segmentation, but I'm sure that direction would greatly improve the quality of that dataset. We would definitely welcome contributions in that direction!
This repo processes datasets using GCP services. May I know if there is any tutorial to use your scripts to process raw data on local machines?
Thanks, Peixiang