MD5-based naming scheme breaks due to "/" in base64-encoded content

Doraku / DefaultDocumentation

Create a simple markdown documentation from the Visual Studio xml one.

MIT No Attribution

160 stars 26 forks source link

MD5-based naming scheme breaks due to "/" in base64-encoded content #119

Closed madelson closed 2 years ago

madelson commented 2 years ago

Thanks for creating this library! I'm trying to use this it to generate API documentation for my projects. I'm trying to use the NameAndMd5Mix mode to avoid long path issues I'm seeing with the default naming scheme.

The problem is that the MD5 hashes are encoded with base 64, which can contain the / character. This causes files to end up in nested folders (e.g. see this file). This in turn breaks all relative links in the nested files (e.g. see the namespace link here).

I think an easy fix would be to use hex encoding rather than base 64. This has the added advantage of being case-insensitive which tends to be better for URLs.

If you're interested, I'd be happy to submit a PR.

Doraku commented 2 years ago

oh that's a dumb oversight on me, I correctly handled it for Md5 but forgot to do the same for NameAndMd5Mix >_> but it might be simplier and safer to use hex enconding like you said, wouldn't that produce file longer hash though?

madelson commented 2 years ago

@Doraku yes the solution you linked there should work and fixes the nesting problem. However, doesn't ? have special meaning in URLs (starts the query string?). I could see this causing issues depending on where the docs are hosted.

There is also still the case-sensitivity problem although I suspect that the risk of an actual collision there is pretty low, comparable to knocking a couple bytes off the hash. Unifying forward- and back- slash as ? similarly increases collision odds.

Hex will lead to hashes that are a bit longer (32 chars vs. 24), so maybe that's a concern. For my use-case it would not be.

madelson commented 2 years ago

Another option would be to use a custom alphabet for the encoding, for example all upper-case letters and digits (36 chars):

Encode(md5, "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ");

static string Encode(byte[] hash, ReadOnlySpan<char> alphabet)
{
    var bi = new BigInteger(hash.Concat(new byte[] { 0 }).ToArray());
    var result = new StringBuilder();
    while (bi != 0)
    {
        bi = BigInteger.DivRem(bi, alphabet.Length, out var remainder);
        result.Append(alphabet[(int)remainder]);
    }
        if (result.Length == 0) { result.Append(alphabet[0]); }
    return result.ToString();
}

This gives hash strings of 25 chars or occasionally less if the hash has enough trailing zero bits (a padding solution could be added to guarantee constant length if desired). The nice thing about these hashes is that they only use very "safe" characters and are case-insensitive.

Doraku commented 2 years ago

that's would be actually pretty cool (and safe) :)