Closed lwahomura closed 4 years ago
Hello! Here is the link to the data from the original paper that this repo is replicating: The source code from the paper is there also, but to be honest I found it a little challenging to use as is, which is why I started this repo!
the format is one word per line, with space-separated characters. The "!" is used to mark the morpheme boundary.
For source data:
o i n k i p a n t i
t l a w a l
t i w e
s k a k o k w i
target:
o ! i n ! k ! i p a n t i
t l a w a l
t ! i w e
s ! k ! a k o k w i
Thanks a lot! This came to be very helpful and the metrics got much better!
Great I'm glad to hear that! Sorry that the code is not thoroughly documented, let me know if there's anything I can help clarify.
Could you please provide some test data? I wanted to try your segmenter, though it'd be easier if there was some data to see the required format.