Open rviscomi opened 4 years ago
Some tips for a good autocomplete UX: https://blog.algolia.com/search-autocomplete-on-mobile/
Two solution from my mind:
Column name | Index type |
---|---|
id | primary key |
prefix | index |
origin | - |
To insert a origin like https://mail.google.com, you should insert below records: | prefix | origin |
---|---|---|
mail.google.com | https://mail.google.com | |
google.com | https://mail.google.com | |
https://mail.google.com | https://mail.google.com |
To search xxx: select * from table where prefix >="xxx" and prefix < ""xxx{"
Query: xxx search above 3 keys one by one using zrangebyscore "keyx" "xxx" "xxx{" https://redis.io/commands/zrangebyscore/
As of now, there are 18,352,960 distinct origins in the CrUX BigQuery dataset since October 2017. That's a lot of websites, but clearly not all of the websites out in the wild. One of the common problems I see from CrUX users is that they're not sure if their websites' origins exist in the dataset.
I'd like to design a tool to help CrUX users quickly and easily discover origins in the dataset and make it clear when a particular origin is not found. Here's how I envision the UX working:
I imagine the backend will work by using a fast, in-memory storage solution for the ~18M origins. In total the size of the data is ~500+MB. However, if we build more advanced/faster search functionality (eg n-grams), it might require more storage space. An autosuggest endpoint will take the user's input, scan the origin list, and return matches via JSON. The list of origins can be populated monthly by mirroring the
chrome-ux-report:materialized.origin_summary
table on BigQuery.Finding matches is the magic part. If a user types
google
it should return origins whose domain name (eTLD+1) starts withgoogle
, likehttps://www.google.com
orhttps://mail.google.com
orhttps://www.google.co.jp
. It should also return matches whose host names (eTLD+2) are prefixed by the query, for examplemail
should returnhttps://mail.google.com
orhttps://mail.yahoo.com
. Searches starting with the protocol (http[s]://
) should only match origins prefixed with that input, likehttps://example.com
matching a search forhttps://ex
. I think this can be simplified to a regular expression where the user input is preceded by a boundary character\b
, but the backend might need to tokenize origins into host names and domain names instead for performance. My goal is for the median autosuggestion to complete buttery smooth in under 100ms from the user's keyup to suggestion rendered.For demonstration purposes, a really naive implementation would be for the backend to query the BigQuery
origin_summary
table directly:This query processes 505 MB in 4.7 seconds, obviously not fast enough for a production-ready solution, but just showing a simple approach.
Any technology recommendations for the backend of the app?