HTML Ampersand Character Codes

LanguageMachines / ucto

Unicode tokeniser. Ucto tokenizes text files: it separates words from punctuation, and splits sentences. It offers several other basic preprocessing steps such as changing case that you can all use to make your text suited for further processing such as indexing, part-of-speech tagging, or machine translation. Ucto comes with tokenisation rules for several languages and can be easily extended to suit other languages. It has been incorporated for tokenizing Dutch text in Frog, our Dutch morpho-syntactic processor. http://ilk.uvt.nl/ucto --

https://languagemachines.github.io/ucto

GNU General Public License v3.0

65 stars 13 forks source link

HTML Ampersand Character Codes #57

Open Irishx opened 6 years ago

Irishx commented 6 years ago

I notice that very often raw texts are taken from the web and they contain & or " strings in the raw text.

These strings are not recognized by ucto and are split into multiple tokens: & amp ;

Obviously the user is responsible for clean text input but these HTML codes are easily overlooked in large quantities of data. We could:

add a rule to recognize HTML codes and keep them as 1 token
add a rule to recognize HTML codes and replace them with the actual character they represent
perhaps only give a warning --this text contains HTML codes-- ?
just close this issue without any changes ;-)

kosloot commented 6 years ago

Hmm, interesting. And probably useful I think the first step would be to create a rule to detect these 'amp-encoded' entities. A next step might be to translate them. e.g. see for instance here

Note; We should handle both é ( é) and ß (ß)

kosloot commented 6 years ago

Hmm, maybe such a rule isn't that easy. Één (de facto: Één) may very well be captured, but still. The assigned class would NOT be WORD, i suppose. That might be confusing, as Één IS tokenized as a word. But I thinks it is worth experimenting...

proycon commented 5 years ago

These are called XML/HTML (character) entities technically. It might indeed be a nice feature to have ucto detect and substitute these, though not of the highest priority I'd say. Best make it opt-in with perhaps an automatic detection and suggestion to enable it?

Note; We should handle both é ( é) and ß (ß)

Yes, and both hexademical and decimal representation.

kosloot commented 5 years ago

Well, solving this as an ucto configuration rule is almost? impossible. HTML entities may occur everywhere, not just in WORDS.

The only feasible solution seems to implement this as a filter in ucto, replacing all kinds of entities by their UTF8 variant. I will label this as an enhancement.