anoopkunchukuttan / indic_nlp_library

Resources and tools for Indian language Natural Language Processing
http://anoopkunchukuttan.github.io/indic_nlp_library/
MIT License
546 stars 158 forks source link

Bad sentence splitting performance on flores 200 hindi language #66

Open asusdisciple opened 1 year ago

asusdisciple commented 1 year ago

I tested the indic nlp package to split sentences on the hindi file in the flores 200 dataset. However the performance is really bad with an F1 score of 0.26. I used the package via the stopes implementation of facebook. My split function looks like this and is applied to a paragraph of 10 sentences. It seems that the package is not recognising a "." as sentence end boundary for some reason. Do you guys have any ideas or proposals?

def split_indic(line: str) -> tp.Iterable[str]:
    """Split Indian text into sentences using Indic NLP tool."""
    line = indic_normalizer.normalize(line)
    for sent in indic_sent_tok.sentence_split(line, lang=lang):
        yield sent

return split_indic
oligoglot commented 1 year ago

I haven't looked at the text, but is that in Devanagari? If so, wouldn't the sentence end be "।" and not "."?