dbmdz / deep-eos

General-Purpose Neural Networks for Sentence Boundary Detection
GNU Affero General Public License v3.0
73 stars 7 forks source link

Question about experiment #5

Open xuchaoUCAS opened 4 years ago

xuchaoUCAS commented 4 years ago

I don't understanding the meaning of this experiment. Too many errors in the gold set. For examples, in the europarl-v7.de-en.en.sentences.test.gold: line 73:I am happy to try and answer, Mr Wijsenbeek. As you will certainly know,……. Here "I am happy to try and answer, Mr Wijsenbeek." is obviously a single sentence and the gold dost't mark is as. Simliar data: line 130,175... too much So I don't understanding the meaning of "sentence boundary detection" in this dataset.

stefan-it commented 4 years ago

Hi @xuchaoUCAS,

we use this kind of dataset, because there are no 100% gold-labeled datasets available for this task. That's why we refer to it as "quasi-segmented" datasets.

However, in preliminary experiments we used Universal Dependencies (normally used for e.g. PoS tagging). These datasets contain a more sentence-segmented structure. But: the number of sentences is less than e.g. the Europarl corpora!