Fix for handling non-Latin characters

qurbat commented 2 years ago

This change introduces support for search results containing non-Latin characters as part of the URL or description.

This is done by passing the final_string variable to the html.unescape() function (instead of printing it directly) at the last print call.

qurbat commented 2 years ago

@deepseagirl could you merge this after review?

qurbat commented 2 years ago

@deepseagirl hi, just sending a ping on this. thanks!

qurbat commented 2 years ago

@deepseagirl Can we close this?

deepseagirl commented 2 years ago

hi, thanks. this is a good improvement :) i moved the unescape to only occur on the result descriptions directly with a flag to toggle the behavior on/off

new default will be to decode character references:

$ python3 degoogle.py "intitle:⟿ inurl:⟿"
-- 9 results --

TranslingualEdit - Wiktionary
https://en.wiktionary.org/wiki/%E2%9F%BF

Talk:⟿ - Wiktionary
https://en.wiktionary.org/wiki/Talk:%E2%9F%BF

flag to turn decoding off:

$ python3 degoogle.py -d "intitle:⟿ inurl:⟿"
-- 9 results --

TranslingualEdit - Wiktionary
https://en.wiktionary.org/wiki/%E2%9F%BF

Talk:&#10239; - Wiktionary
https://en.wiktionary.org/wiki/Talk:%E2%9F%BF

the html.unescape python doc links to this list of named character references which seemed handy. i didn't realize char references were such an in depth thing until now. if you're interested here is that link https://html.spec.whatwg.org/multipage/named-characters.html#named-character-references

deepseagirl commented 2 years ago

i'll finalize this when i have a few more mins. should be soon now that it's this far along. thanks again

qurbat commented 2 years ago

@deepseagirl no worries, and I realize you were not able to access a computer earlier, so it is no problem. the new changes look great! thank you & tc =)

qurbat commented 2 years ago

@deepseagirl can we close?

deepseagirl / degoogle

Fix for handling non-Latin characters #7