Implement Key-Value Table for URL Aliases

GPropersi commented 1 month ago

A minor optimization to make - the user can type in multiple variations of google.com to reach Google's homepage, such as the following:

But in our database, after normalizing it, we store the URL as: https://www.google.com/

The process of normalizing requires both CPU and I/O over the network. If, in another UTub, the user wants to add google.com, we search the Urls table for google.com, see it's not in there, and have to perform the whole URL normalization process again.

This can increase latency in the response to the user if they don't type in the URL exactly how we store it. Particularly the network request can be most time consuming, as we sometimes need to retry with different HTTP headers, leading to multiple network calls.

I'd like to propose an additional table in our database that stores all associated variations of a URL, and what they point towards. This could be called UrlAliases - we would search this table first for a user's input, and then the main Urls table second, before performing the URL normalization process.

For example, in `Urls` we could have:	PK	urlString
1	"https://www.google.com/"

And in UrlAliases we could have:

PK	urlAlias	FK
1	"www.google.com"	1
2	"google.com"	1

Which all point towards the same entry in Urls.

So now, if in another UTub, a user added google.com, we'd find it in the UrlAliases table, and not have to perform the network request to verify the URL.

rehankalu commented 1 month ago

I didn't think we were storing URLs in the way the user inputs them. In the past, I have typed "gmail.com" as a test URL. Upon successful POST, the returned string for display was "mail.google.com". I had assumed there was no distinction made on the backend; order of operations being: 1. User submits URL string, 2. Backend normalizes/validates string, 3. URL is stored in DB

GPropersi commented 1 month ago

You are 90% right - we are not storing URLs the way user inputs them, UNLESS they input them in the exact same way as the normalized URL output.

Our backend normalizes google.com to https://www.google.com/. If the user adds the first (google.com) OR the second one (https://www.google.com/), it will still only be saved as https://www.google.com/

Validation is done by performing an HTTP request against the normalized URL, and waiting for success. This could be multiple HTTP requests, since sometimes we need to rotate HTTP headers in order to get a proper response.

When the user types in google.com, the whole Urls table is searched for google.com, even if the Urls table already contains https://www.google.com/, which is what it would be normalized to. Only after normalizing, does the backend discover that https://www.google.com/ is already stored in the database.

I'm proposing that - since we now know what unnormalized URL (google.com) points towards (https://www.google.com/), we can effectively create a key-value sort of table, associating unnormalized URLs with normalized URLs, and preventing validation of URLs such as google.com everytime they are typed in. This way, if the user types in google.com, we would search this key-value table, without having to perform the validation (since we already validated it).

4IRL / urls4irl

Implement Key-Value Table for URL Aliases #202