BurntSushi / ucd-generate

A command line tool to generate Unicode tables as source code.
Apache License 2.0
93 stars 21 forks source link

Allow outputting tables as C code instead of Rust #16

Closed thomcc closed 4 years ago

thomcc commented 4 years ago

I did this a while ago, and it works great, but I didn't upstream it because I greatly doubted that it would be accepted (and to be clear I won't be offended it is not). That said, not even submitting a PR for it isn't good form, so if you're interested:


It emits fully standards-compliant C89, and can be used with the general-category, script, script-extension, property-bool, perl-word, case-folding-simple, grapheme-cluster-break, word-break, and sentence-break subcommands so long as you don't try to output a fancy format like a trie-set, etc. Clap should handle that sort of thing but if it made it through to the writer, it will hit one of the ensure_not_c cases (which you can use to get an idea of what isn't supported).

It ended up requiring some slight tweaks to the output of the rust code as well (this could be avoided, but seemed irrelevant):

  1. The header comment uses /* */ instead of // as rust supports both and C89 supports only the former
  2. Declarations come before uses now, because Rust doesn't care, but C does.

For C code it also emits a small usage note in the header, as there's an important doc note to avoid misuse, and there didn't appear to be anywhere else for it to go. Beyond that it does sorta expect the user to inspect the output to figure it out 🤷‍♂ .


I think it might not be obvious why this could be useful here, and so if you're on the fence about supporting it, my (weak) arguments for why to merge it are:

  1. No existing tool fills this niche, which is arguably more valuable for C than for Rust -- C doesn't have the ability to just pull something off of crates.io, and libc's handling of unicode is usually completely broken (if you can say it exists at all). For cross-platform unicode data libraries the only option is something like ICU, which is very heavyweight, or generating tables yourself.

  2. New commands/features/etc aren't forced to support --emit-c, and given that much already doesn't, it wouldn't be out of place. So IMO it's not like this would increase the amount of effort needed to add new features by very much.

That said, the other side of this is, a valid argument is "maybe people should stop writing C then" which, yeah. Anyway I won't be offended if this isn't merged.

data-man commented 4 years ago

@thomcc Thank you! Can you use uint8_t, uint16_t, etc. for more compact sizes?

thomcc commented 4 years ago

Possibly for some tables but in general no, not all codepoints fit in 16 or 8 bits, and it seems bad to generate a table where an update to Unicode (which added a codepoint with more bits to some category, for example) could completely break the code using the table (which assumes some table only has 16bit or 8bit items). I suspect supporting the trie output would probably be more effective for reducing size, although I don't have any evidence to back that up.

Also, those types are from stdint.h, which isn't available in C89, although it's probably available everywhere these days, and it's not like the current code is exceptionally portable to weird systems -- it tacitly assumes unsigned int is at least 32 bits.

BurntSushi commented 4 years ago

Hmm, this is a tough one. On the one hand, I sympathize with the utility of this. On the other hand, my strongest argument against this is the maintenance burden and the very likely possibility that the --emit-c output will break. ucd-generate doesn't really have test coverage. If it did, and tests covered the C output, then I might be more amenable to this. But adding test coverage for all the various outputs is a lot of work. For the most part, I've been relying on dog-fooding to check anything that's wrong and I'm very unlikely to dogfood the C output feature.

There's also the other argument that this opens the door for more language outputs. Not so sure I want to go down that route.

Could you perhaps expand a bit more on what you'd want to use this for?

thomcc commented 4 years ago

ucd-generate doesn't really have test coverage. If it did, and tests covered the C output, then I might be more amenable to this

Yeah, I was going to add tests, but realized none of the existing stuff had it, and that it would be a pain to find a way to hook it in because of that.

If it would make the difference for whether to merge this, I could try to find time to add smoke tests which generate c code and making sure they compile. (That's certainly not a guarantee of them working in the face of other changes, but would catch the worst issues, and might provide a starting point for adding more tests, e.g. for the rust parts).

Could you perhaps expand a bit more on what you'd want to use this for?

My use for it was pretty pointless, it was ultimately a way of removing an ICU dependency from a project I wrote a while ago. It served that fine, even if it was a bit larger of a yak than I had intended to shave when I started.

I don't have any future plans for it at the moment, but would reach for it again if it came up in similar situations.

I mean ultimately the uses for outputting C are the same as for Rust -- it isn't hard to imagine that there are just as many C or C++ projects which build unicode tables with an ad-hoc python script or whatever. (I'd guess way more -- deps are a much bigger hassle for C and C++ projects so they can't just do the equivalent add a unicode-rs crate equiv (You can add ICU, but this is a big can of worms), and it's not like either C or C++ provides even remotely good unicode tooling in their stdlib the way rust does).

I'm pretty sympathetic to all the reasons you mentioned against merging it though, which is why there were a few months between when I wrote it and when I submitted the PR -- seemed like a bit of a white elephant gift, I guess.

BurntSushi commented 4 years ago

I mean ultimately the uses for outputting C are the same as for Rust

Yeah I hear that. Mostly I was after specifically what you were using this for. e.g., I might be willing to spend more effort maintaining something like this if this PR directly translated into big wins for a lot of projects, or for a very large project. If it's just feeding a niche use at the moment, then I'm kind of inclined to not accept this unfortunately. You could always maintain the patch if you wanted.

FWIW, at some point, I'd like to rewrite the output generation code to use templates. Perhaps as part of that I will come up with some kind of test harness, in which case, I'd be more amenable to accepting something like this.

thomcc commented 4 years ago

e.g., I might be willing to spend more effort maintaining something like this if this PR directly translated into big wins for a lot of projects, or for a very large project. If it's just feeding a niche use at the moment, then I'm kind of inclined to not accept this unfortunately. You could always maintain the patch if you wanted.

Makes sense, and yep, that's not the case here. At the moment I'm lucky enough that all the large projects I work with already have rust in their build system.

Yeah, I'll probably just use the branch in my fork if it comes up again or something.

FWIW, at some point, I'd like to rewrite the output generation code to use templates

That does sound better, assuming it wouldn't be limiting. If you remember, and this happens, feel free to ping this issue or something. I do think it would be nice to have as an option, but it's not surprising to me that the maintenance burden outweighs the utility for you in the current state.

--

Alright, I didn't really think this would get merged, and was mostly providing a courtesy in case it was something you happened to want. Also, maybe others who need this could find this issue and make use of the branch on my fork (although it will probably have succumb to bitrot by then).

Anyway, I'll close the issue now.