JuliaText / WordTokenizers.jl

High performance tokenizers for natural language processing and other related tasks
Other
96 stars 25 forks source link

Sentence Splitters: no sentence break in between two words with no punctuation #62

Open dhruvil410 opened 3 years ago

dhruvil410 commented 3 years ago

Fix #60 We can also fix the issue by replacing \n by space at starting, when we get sentences, means we can add sentences=replace(sentences, r"\n" => Base.SubstitutionString(" ")) this line at starting of function rulebased_split_sentences(sentences). We can also add different characters other than alphanumeric in committed code. Which is better way to fix this issue? or any suggestions other than this.

triztian commented 3 years ago

I think perhaps adding tests would help in making this fix more robust, also since it'd be changing the output of the function, maybe make it an optional keyword arg so that those that need it to behave that way enable the behavior explicitly rather than it changing all of the sudden.

For example updating rulebased_split_sentences:

https://github.com/JuliaText/WordTokenizers.jl/blob/d181905784f1130ef601f1a80a7f5b8065a4404a/src/sentences/sentence_splitting.jl#L1

So that it can be called like this:

rulebased_split_sentences(sentence, collapse_newlines=true)

So that multiple newlines are reduced to 1 newline and single newlines removed.

dhruvil410 commented 3 years ago

I have no idea about checks. Why didn't code pass checks?