Howdju / howdju

Monorepo for the Howdju crowdsourced fact checking and summarization platform
https://www.howdju.com
GNU Affero General Public License v3.0
5 stars 2 forks source link

Ensure that anchors work with multibyte Unicode #441

Open carlgieringer opened 1 year ago

carlgieringer commented 1 year ago

Quote: Can the legacy press just wish RFK Jr. away? I don’t think so. And I think it’s absolutely foolish to try.

From encodeURIComponent: https://www.thefp.com/p/rfk-jr-is-striking-a-nerve-with-democrats#:~:text=Can%20the%20legacy%20press%20just%20wish%20RFK%20Jr.%20away%3F%20I%20don%5Cu2019t%20think%20so.%20And%20I%20think%20it%5Cu2019s%20absolutely%20foolish%20to%20try.

From Chrome: https://www.thefp.com/p/rfk-jr-is-striking-a-nerve-with-democrats#:~:text=Can%20the%20legacy%20press%20just%20wish%20RFK%20Jr.%20away%3F%20I%20don%E2%80%99t%20think%20so.%20And%20I%20think%20it%E2%80%99s%20absolutely%20foolish%20to%20try.

carlgieringer commented 1 year ago

I think the initial issue was an occurrence of #437 because this link (containing a curly quote and also at the beginning of the page) works: https://www.thefp.com/p/douglas-murray-elizabeth-barrett-browning#:~:text=Elizabeth%20Barrett%20Browning%E2%80%99s%20love%20poem%20is%20famous.%20But%20her%20own%20love%20story%20is%20just%20as%20legendary.

This behavior might also arise from incomplete unicode support in dom-anchor-text-quote (transitively from https://github.com/google/diff-match-patch/pull/80.) This issue will now focus on the Unicode issue.

carlgieringer commented 1 year ago

A casual check with emoji (https://en.wikipedia.org/wiki/Emoji#:~:text=In%202015%2C%20Oxford%20Dictionaries%20named%20the%20Face%20with%20Tears%20of%20Joy%20emoji%20(%F0%9F%98%82)%20the%20word%20of%20the%20year.%5B10%5D%5B11%5D) indicates that it works. It might be that the laughing emoji has no surrogate pairs 😂, in which cause this is not a good test. Since our initial primary focus is on English language, I am going to defer this issue until it arises naturally.

I also tried to create an anchor containing Chinese and it worked as well: https://en.wikipedia.org/wiki/Chinese_Wikipedia#:~:text=The%20Chinese%20Wikipedia%20(traditional%20Chinese%3A%20%E4%B8%AD%E6%96%87%E7%B6%AD%E5%9F%BA%E7%99%BE%E7%A7%91%3B%20simplified%20Chinese%3A%20%E4%B8%AD%E6%96%87%E7%BB%B4%E5%9F%BA%E7%99%BE%E7%A7%91%3B%20pinyin%3A%20Zh%C5%8Dngw%C3%A9n%20W%C3%A9ij%C4%AB%20B%C7%8Eik%C4%93)%20is%20the%20written%20vernacular%20Chinese%20(a%20form%20of%20Mandarin%20Chinese)%20edition%20of%20Wikipedia.