Closed arademaker closed 4 years ago
Hi @lluisp ! I just tested and I was able to add ...
in the splitter.dat file. Actually, I understood that this is possible because sppliter runs after the tokenizer and I found in the tokenizer.dat file the rule for keeping ...
as a single token.
Summing up, I believe the following sentence in the documentation should be revised since the word characters
is missleading. We can have not only single characters as sentence end markers, right?
The
section lists which characters are considered as possible sentence endings.
Yes, that's it... Sorry for missing your previous post :}
Hi @lluisp , what about sentences ending with abbreviations. In the corpus I am working, we have many S.A.
in the end of the sentences. If I add this abbreviation, the tokenizer take it as a single token. But them the sentence splitter can't find the period to split the sentence. Any general advice?
I believe that in Portuguese people follow the general rule for English https://english.stackexchange.com/questions/8382/when-etc-is-at-the-end-of-a-phrase-do-you-place-a-period-after-it
Example:
Tornou-se nesse ano diretor da empresa Buriti S.A. De 1948 a 1950, presidiu a Associação Comercial da Bahia.
FreeLing does sentence splitting after tokenization, which complicates things a bit. We plan to build a sentence splitter that works on text some day... So, to solve this you need to cheat freeling somehow... one option could be adding "S.A." to the list of sentence ending markers.... It is a hack, but I am afraid there is not much you can do.
CORENLP also makes split after tokenization. What would be the benefits of doing split before tokenization? OpenNLP does that but using a training-based approach. What would be your approach? CORENLP correctly split the fragment below in two sentences:
Apple Computer, Inc. became Apple, Inc. Alex said today.
Regarding the hack, the list of abbreviations is big, almost all already listed in the abbreviations tag in the tokenization configuration file. The suggested idea could be applied for all abbreviations by the splitter, don't you think?
On 11/2/20 0:26, Alexandre Rademaker wrote:
CORENLP also makes split after tokenization. What would be the benefits of doing split before tokenization?
If you do it after, you deal with tokens, not with characters, so you have issues such as that of S.A.
OpenNLP does that but using a training-based approach. What would be your approach?
Yes, it should be ML-based. In that case, you can have features that refer to characters, so it doesn't matter if it is after or before (because you are working at char level anyway)
Regarding the hack, the list of abbreviations is big, almost all already listed in the abbreviations tag in the tokenization configuration file. The suggested idea could be applied for all abbreviations by the splitter, don't you think?
Yes, but that would split names such as "my boss S.A. Smith is a very nice man"
I think is better to miss one sentence split than to add false ones...
How to make it understand that
...
can mark the end of a sentence? Only single characters can be added in the SentenceEnd list, right? In that case, a possible solution would be the preprocessing rewriting...
to…
. Is that the only alternative?