charlie-map / wiki-suggestor-service

A C backend that makes suggestions for the Wikiread extension
0 stars 0 forks source link

Fix Small Tokenization Issue Where the Diamond Question Mark Appears #7

Closed charlie-map closed 2 years ago

charlie-map commented 2 years ago

Currently, some results that contain certain characters like em-dashes (or – in HTML number form) return as a diamond question mark, for example:

Thomas Edward Watson (September 5, 1856 � September 26, 1922) was an American politician

More research is necessary for figuring out where this occurs, or if this necessitates having a separate table to reference to see what the character should be.

charlie-map commented 2 years ago

See #8 for a semi-solution. The full solution would be to go through the page with some text file set up that does something like:

   
& &

and then goes through the document and replaces each occurrence. Some testing will have to be done to see how reasonable this is, but in reality replacing only needs to be done if the result is being sent to the user and it must be ensured there are no .

charlie-map commented 2 years ago

Since this is unrelated to the tokenize process and instead related to the ASCII encoding when trying to print the values into the linux terminal, this issue is not actually needed.