gc2gh / splitta

Automatically exported from code.google.com/p/splitta
0 stars 0 forks source link

Training file format #8

Open GoogleCodeExporter opened 8 years ago

GoogleCodeExporter commented 8 years ago
Hi!

I'm trying to use your system for Portuguese and other latin languages.

I've read other issues asking about the training file format but I could not 
find an answer :(

Can you post/send me a snippet of the training file?

Thank you,
Mario

Original issue reported on code.google.com by mario.al...@gmail.com on 22 Jun 2011 at 5:21

GoogleCodeExporter commented 8 years ago
If sentence boundary detection is the only goal and you want to simply train a 
classifier for that then your training data should be a bunch of file(s) in 
which the delimiter is preceeded by <S>. for ex. your train file could contain 
the following text

The quick brown fox jumps over the lazy dog <S>. Mr. XYZ went to New York <S>. 

Note that the period is preceeded by <S>.  You need not separate each sentence 
on a newline. 

Original comment by rohitkel...@gmail.com on 2 Jun 2013 at 4:23