atyrode / vite

✂️ vite! API shortens your links
http://vite.lol/
The Unlicense
0 stars 0 forks source link

TSK: Implement a database-oriented codec logic #6

Closed atyrode closed 7 months ago

atyrode commented 7 months ago

I need to figure out a proper codec that can effectively:

Thereby creating an effective string shortener.

After delving a bit into the theory of information and lossless compression algorithms (relevant read: Lossless compression and Pigeonhole principle), I've decided the most straight-forward option was using a mix of a database and the encoding of each row's unique ID into a string, hence #5.

I'll be trying an implementation of that, but I'm still on the look out for a more algorithm-based approach.

atyrode commented 7 months ago

The URLCharset class being probably done for, I'm working on an obfuscator module for the codec now

atyrode commented 7 months ago

The Obfuscator class takes an instance of the URLCharset class and a passphrase str, and can translate a given input with the transform() and restore() functions. This effectively allow soft-obfuscation of the upcoming encoding/decoding, whose aim is to avoid predictability in the URL format.

(The goal is to avoid the first url being encoded to AAAAA, and the second to AAAAB, the passphrase will help scramble the database's row's unique id)

atyrode commented 7 months ago

Nevermind, I just realized the logic I implemented simply translates the encoding, but doesn't truly scramble it.

With a codec outputing 1 as 00001 and 2 as 00002

  1. Obfuscated 1 as ttttu
  2. Obfuscated 2 as ttttv

It is still predictable by design, and I have to look further into how to properly scramble it. My goal is to not having to query the database (for performance) to unobfuscate it, but I this might be unavoidable.

I will implement a test that ensures unpredictability.

atyrode commented 7 months ago

Getting improved results (repeated characters (here 0)) do not produce seemingly predictable output, however they still look too close from one another, will attempt to have a seemingly complete randomization of the whole result

Charset: 0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ
Scrambled charset: cTLB9aGtuhAbIUCn4w6zS2mQj0KlV1ip7xrORs8ZyFD5YHNeJ3XkMqvgEdWfoP
Passphrase: snippy
--------------------------------------------------------------------------------
Source: 00001
Obfuscated: dlfZ5
Deobfuscated: 00001
====================
Source: 00002
Obfuscated: dlfZO
Deobfuscated: 00002
====================
atyrode commented 7 months ago

After thorough inspection of other shortening services, it seems as though the standard practice goes against two of my beliefs:

I've overlooked a fallacy in my reasoning: To obfuscate the result in an attempt at avoiding traversability was a humble concern, but it would have stopped working when the database reaches all possible links and has to increase the length of the URL by 1, thereby making all possible combination valid.

Therefore, I will not pursuie further efforts into the obfuscation process and align the logic with similar products