bgamari / trac-to-remarkup

Moved to GitLab: https://gitlab.haskell.org/bgamari/trac-to-remarkup
https://gitlab.haskell.org
BSD 3-Clause "New" or "Revised" License
2 stars 0 forks source link

Abbreviations are mis-transliterated in Wiki page names #38

Open bgamari opened 5 years ago

bgamari commented 5 years ago

A significant fraction of the pages in the Status/ wiki namespace are mis-named. Specifically, the string GHC is transliterated as gh-c. For instance, Status/GHC-7.10.3 is translated to status/gh-c-7.10.3.

I suspect this isn't worth fixing as it's quite easy to fix-up post-facto but I thought I should at least record the infelicity.

bgamari commented 5 years ago

I am actually starting to think that maybe we do want to fix this. This mistake is made for numerous abbreviations including:

Even if we fixed these, I'm generally concerned that the change in capitalization will result in broken links. We should at very least produce a name mapping that can be used to generate an nginx redirection table.

tdammers commented 5 years ago

I can see where this goes wrong. We use the casing library to convert these names, and specifically, we use fromHumps to parse the original name into "words". That parser expects camel case or pascal case, and specifically, the .NET-flavor, where abbreviations longer than 2 characters are written like words (e.g. XmlParser, not XMLParser).

I'll see if I can come up with a better approach to parsing these.

tdammers commented 5 years ago

It seems that applying the possible parsers in the right order does the trick: instead of fromHumps, we'll use fromKebab >=> fromSnake >=> fromHumps. In this particular case, fromKebab will split things into "GHC", "7.10.3", and then fromHumps will operate on "GHC", which, due to no non-uppercase letter following any uppercase letter, will not split it any further.

I'll push the fix once I'm done testing.

tdammers commented 5 years ago

The capitalization change, btw., is inevitable, as Gitlab will automatically lowercase everything; we inject dashes to keep it readable and comply with Gitlab's naming conventions.

If we want external links to remain functional, we will need a translation table though, or a very clever way of replicating the name mangling on the fly.

tdammers commented 5 years ago

9e5c2ba6fe023373dbb0f29fb5f356206e75fc17 fixes the name mangling so that GHC-7.10.3 becomes ghc-7.10.3 rather than gh-c-7.10.3.

We still need to emit a mapping though.

tdammers commented 5 years ago

Oh, and we need nginx rewrites anyway, because the old links will point to /trac/wikis/ghc/..., whereas the new ones need to go to /ghc/ghc/wikis.

tdammers commented 5 years ago

a467e24dd989efcfaaf7b51928821b8c73c33583 adds an nginx rewrite rule generator. Generated rules are appended to rewrite.nginx in the CWD; for a clean set of rules, one should clear out this file prior to running an import, post-process it with something like sort -u to remove duplicates, and then paste it into the nginx configuration in the appropriate location. (Actual usage of said rules untested as of yet).

bgamari commented 5 years ago

The capitalization change, btw., is inevitable, as Gitlab will automatically lowercase everything; we inject dashes to keep it readable and comply with Gitlab's naming conventions.

Is this really true? Looking at https://gitlab.staging.haskell.org/ghc/ghc/wikis/Trac-Ticket-Import I'm a bit doubtful. In fact, I think things would be significantly more readable if we did preserve capitalization. The only thing we need to change is whitespace.