InvisiblePlatform / rosetta

Invisible Voice Version 3.0 The Reckoning
0 stars 0 forks source link

Fix the websites that are "something.website.com" #52

Closed MorkForid closed 2 years ago

ixt commented 2 years ago

Some local domains still arent being picked up too. I think we should have a "best guess" mode where it takes liberties with the site matching. i.e:

  1. remove tld .com, .ninja, .sg etc.
  2. remove any subdomains (what is mentioned as "something" in issue title)
  3. regex match any entries that begin with the result of above

so consider https://smile.amazon.co.uk

  1. smile.amazon
  2. amazon
  3. amazoncom

The only issue will be with gTLDs of said companies, consider https://about.google

  1. about
  2. about
  3. aboutcom

we could just take non-privately owned gTLD and ignore the privately owned ones. There is a proper name for this sort of gTLD and I believe ICANN publishes a ground truth list.

Ideally, we just need a better matching between sites and data. It is still a bit crummy, it would be good to do a separate mapping instead of the simple mapping we do right now. But I can't intuit any good solution right now.

ixt commented 2 years ago

Closing in favour of #64 as it should solve this problem naturally, some exceptions may arise because of duckduckgo's data being somewhat small, but in theory it should cover the vast majority of cases my ballpark is 98% coverage using the combination of EFF and ddg data.