PolyAI-LDN / conversational-datasets

Large datasets for conversational AI
Apache License 2.0
1.29k stars 167 forks source link

Any ways to process dataset on local machines? #57

Closed zhongpeixiang closed 5 years ago

zhongpeixiang commented 5 years ago

This repo processes datasets using GCP services. May I know if there is any tutorial to use your scripts to process raw data on local machines?

Thanks, Peixiang

zhongpeixiang commented 5 years ago

I figured it out how your script works on local machines.

For OpenSubtitles, you assume that the context of a message is the preceding N sentences. Do you have any plan to use some segmentation tools to segment the turns of each conversation? E.g., Automatic Turn Segmentation for Movie & TV Subtitles.

Thanks, Peixiang

matthen commented 5 years ago

(For future reference, you can run locally using --runner DirectRunner)

We don't have any current plans to do better turn segmentation, but I'm sure that direction would greatly improve the quality of that dataset. We would definitely welcome contributions in that direction!