JnRMnT / ZemberekDotNet

ZemberekDotNet is the .NET Port of Zemberek-NLP (Natural Language Processing tools for Turkish).
Apache License 2.0
15 stars 4 forks source link

Update SpaceTabTokenizer.cs #1

Closed ilysorc closed 2 years ago

ilysorc commented 3 years ago

Fixed Substring length problem.

ilysorc commented 3 years ago

Hi again @JnRMnT,

The cause of the problem is Substring works differently in .NET. Java Substring expects startIndex and endIndex, but .NET expects length in 2nd parameter. Also, SpaceTabTokenizer throws an error "The substring does not work as expected if the length is greater than the String length" when splitting words in .NET. You can check this with news-title-category datasets first row "labeldünya Yabancılar dokunmasın diye kızının boğulmasına göz yumdu".