Create sentence splitting dataset

ctschroeder commented 10 months ago

[x] export sentences from gold treebank corpus (based on syntax tree may be two sentences or not complete sentences)
[x] spreadsheet: prev-sentence, this-sentence, english, merge back?, type (probably merge back, probably multiple sentences, leave alone)
- if multiple sentences put a pipe in where the sentences should break
[x] LCBM classify sentences, add pipes to break sentences https://docs.google.com/spreadsheets/d/1giiCUFwEh2PcOZT2FBSVVHF-WhpCPqKGfw-fIJfTL74/edit#gid=0

NB later: Train on original layer? or create two models?

ctschroeder commented 6 months ago

adding note that @LCBM0828 has handed this off to @amir-zeldes

amir-zeldes commented 6 months ago

Wonderful, will take a closer look after the release. At some point we could also consider maintaining a repo for that data, or maybe some scripts to harvest additional reliable sentences from new datasets we release to grow this data.

CopticScriptorium / coptic-nlp

Create sentence splitting dataset #35