Open GPropersi opened 1 month ago
I didn't think we were storing URLs in the way the user inputs them. In the past, I have typed "gmail.com" as a test URL. Upon successful POST, the returned string for display was "mail.google.com". I had assumed there was no distinction made on the backend; order of operations being: 1. User submits URL string, 2. Backend normalizes/validates string, 3. URL is stored in DB
You are 90% right - we are not storing URLs the way user inputs them, UNLESS they input them in the exact same way as the normalized URL output.
Our backend normalizes google.com
to https://www.google.com/
. If the user adds the first (google.com
) OR the second one (https://www.google.com/
), it will still only be saved as https://www.google.com/
Validation is done by performing an HTTP request against the normalized URL, and waiting for success. This could be multiple HTTP requests, since sometimes we need to rotate HTTP headers in order to get a proper response.
When the user types in google.com
, the whole Urls
table is searched for google.com
, even if the Urls
table already contains https://www.google.com/
, which is what it would be normalized to. Only after normalizing, does the backend discover that https://www.google.com/
is already stored in the database.
I'm proposing that - since we now know what unnormalized URL (google.com
) points towards (https://www.google.com/
), we can effectively create a key-value sort of table, associating unnormalized URLs with normalized URLs, and preventing validation of URLs such as google.com
everytime they are typed in. This way, if the user types in google.com
, we would search this key-value table, without having to perform the validation (since we already validated it).
A minor optimization to make - the user can type in multiple variations of google.com to reach Google's homepage, such as the following:
But in our database, after normalizing it, we store the URL as:
https://www.google.com/
The process of normalizing requires both CPU and I/O over the network. If, in another UTub, the user wants to add
google.com
, we search theUrls
table forgoogle.com
, see it's not in there, and have to perform the whole URL normalization process again.This can increase latency in the response to the user if they don't type in the URL exactly how we store it. Particularly the network request can be most time consuming, as we sometimes need to retry with different HTTP headers, leading to multiple network calls.
I'd like to propose an additional table in our database that stores all associated variations of a URL, and what they point towards. This could be called
UrlAliases
- we would search this table first for a user's input, and then the mainUrls
table second, before performing the URL normalization process.Urls
we could have:And in
UrlAliases
we could have:Which all point towards the same entry in
Urls
.So now, if in another UTub, a user added
google.com
, we'd find it in theUrlAliases
table, and not have to perform the network request to verify the URL.