Open xuchaoUCAS opened 4 years ago
Hi @xuchaoUCAS,
we use this kind of dataset, because there are no 100% gold-labeled datasets available for this task. That's why we refer to it as "quasi-segmented" datasets.
However, in preliminary experiments we used Universal Dependencies (normally used for e.g. PoS tagging). These datasets contain a more sentence-segmented structure. But: the number of sentences is less than e.g. the Europarl corpora!
I don't understanding the meaning of this experiment. Too many errors in the gold set. For examples, in the europarl-v7.de-en.en.sentences.test.gold: line 73:I am happy to try and answer, Mr Wijsenbeek. As you will certainly know,……. Here "I am happy to try and answer, Mr Wijsenbeek." is obviously a single sentence and the gold dost't mark is as. Simliar data: line 130,175... too much So I don't understanding the meaning of "sentence boundary detection" in this dataset.