coolaj86 commented 2 years ago

I found this code via the article https://medium.com/analytics-vidhya/base-62-text-encoding-decoding-b43921c7a954.

However, this base62 implementation seems to give different output from other implementations.

Reference Implementation

https://github.com/keybase/saltpack/encoding/basex seems to be correct and agree with other implementations:

package main

import (
    "encoding/base64"
    "fmt"

    "github.com/keybase/saltpack/encoding/basex"
)

func main() {
    for _, src := range [][]byte{
        []byte("Hello, 世界"),
        []byte("Hello World"),
        {0, 0, 0, 0, 255, 255, 255, 255},
        {255, 255, 255, 255, 0, 0, 0, 0},
    } {
        // Uses the GMP character set "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"
        b62 := basex.Base62StdEncoding.EncodeToString(src)
        b64 := base64.RawURLEncoding.EncodeToString(src)

        fmt.Printf("Base64: %s (%d chars)\n", b64, len(b64))
        fmt.Printf("Base62: %s (%d chars)\n", b62, len(b62))
        fmt.Println("")
    }
}

Results:

Raw   : "Hello, 世界"
Base64: SGVsbG8sIOS4lueVjA (18 chars)
Base62: 1wJfrzvdbuFbL65vcS (18 chars)

Raw   : "Hello World"
Base64: SGVsbG8gV29ybGQ (15 chars)
Base62: 73XpUgyMwkGr29M (15 chars)

Raw   : [ 0, 0, 0, 0, 255, 255, 255, 255 ]
Base64: AAAAAP____8 (11 chars)
Base62: 000004gfFC3 (11 chars)

Raw   : [ 255, 255, 255, 255, 0, 0, 0, 0 ]
Base64: _____wAAAAA (11 chars)
Base62: LygHZwPV2MC (11 chars)

Your results

Base64: SGVsbG8sIOS4lueVjA (18 chars)
Base62: 4ov7Dg7P22BoCIAFQD02G (21 chars)

Base64: SGVsbG8gV29ybGQ (15 chars)
Base62: 4ov7Dg7Oq5p17cS01c (18 chars)

Base64: AAAAAP____8 (11 chars)
Base62: 000000H31H31 (12 chars)

Base64: _____wAAAAA (11 chars)
Base62: H31H31000000 (12 chars)

abhishekjhaji commented 2 years ago

Sorry I don't get the issue? Assuming your are concerned about the encodings are different compared to other library? Would like to understand why do you think that different values for encoding is an issue? It will always be dependent on the underlying implementation.

Following two conditions are sufficient to test for correctness:

Output should have all the characters from the selected 62 chars.
Encoded string could be decoded back to original string.

Looking at the output it seems that the library you have used is more efficient as it can encode same string in less number of characters.

coolaj86 commented 2 years ago

Would like to understand why do you think that different values for encoding is an issue?

It will always be dependent on the underlying implementation.

Yes, but if one base64 library can't decode the output of another base64 library that's not "an implementation detail", it's a bug.

Likewise, if you're creating your own methodology for a different type of base62 than the defacto standard that's been around since the 90s with GMP and GnuPG, then you're literally creating your own definition of "base62".

Also, I don't think that you're simply using a different alphabet or encoding scheme, I think that your math is actually incorrect, but in a way that just happens to work in both directions.

Just like converting between hex and decimal we could use a special alphabet rather than 0-9A-F, however, the algorithm should never change. It's a mathematical constant.

abhishekjhaji / go-base62

Does not agree with GMP / GnuPG / Saltpack Base62 #1

Reference Implementation

Your results