Add support for parsing files under `extracted/`

BurntSushi / ucd-generate

A command line tool to generate Unicode tables as source code.

Apache License 2.0

95 stars 21 forks source link

Add support for parsing files under `extracted/` #46

Closed inquisitivecrystal closed 2 years ago

inquisitivecrystal commented 3 years ago

Rust needs the ability to parse extracted/DerivedNumericValues.txt as part of rust-lang/rust#84056. This adds parsing support for that file and all the other files under extracted/.

inquisitivecrystal commented 2 years ago

@BurntSushi Sorry for the nag, but can I ask for an update on this? If this is too large, I'd be happy to seperate out the portion that we actually need, extracted/DerivedNumericValues.txt, to make it easier to review.

inquisitivecrystal commented 2 years ago

Thanks so much for merging this. I really appreciate it, especially as I know things have been so busy for you. I'm glad https://github.com/rust-lang/rust/issues/84056 is finally unblocked! 🎉

BurntSushi commented 2 years ago

No problem and sorry it took so long! Incidentally, I didn't realize this was blocking work for std (although I now see you did link it in your initial comment, whoops). Is ucd-generate used to generate the Unicode tables for std? I didn't know about that.

inquisitivecrystal commented 2 years ago

Yep, the unicode-table-generator tool used to make the standard library's unicode tables uses ucd-parse for parsing. It does its own table generation, because the space requirements of the standard library are a bit bespoke.

I also mentioned the std element a few times in our Zulip conversations. It's not a huge deal though, especially as it seems like the work this was blocking may not be a good idea anyway.

BurntSushi commented 2 years ago

@inquisitivecrystal Ah interesting. I bet some of those space saving tricks would be useful for regex-syntax too. See #30 and #39 for some ideas in this direction.

I also mentioned the std element a few times in our Zulip conversations. It's not a huge deal though, especially as it seems like the work this was blocking may not be a good idea anyway.

Ug sorry. Yeah, my brain has been a pile of mush for the past couple of years. I've only recently just started coming back up for air and getting more time to devote to projects.

BurntSushi commented 2 years ago

To elaborate a bit more on regex-syntax, basically, the tables are read when compiling a regex and not when searching. While regex compilation still needs to be reasonably fast, it would be acceptable to make using the Unicode tables slower if it allowed us to make the tables smaller. As it stands currently, I've basically invested no work or time in shrinking the tables at all. They are just sorted sequences of codepoint ranges. regex-syntax embeds a considerable amount of Unicode data (which can be disabled using Cargo features at least), but all of it is included by default. So space savings there would be a huge win.

BurntSushi commented 2 years ago

So what I'm trying to say is that if rustc has these super optimized/compressed formats for codepoint tables, it could be worth porting them to ucd-generate. With that said, it can be frustrating to rely on an external project for such a key thing inside of std. But, I wanted to throw it out there that there is almost certainly demand for the Herculean efforts being made elsewhere. :-)

thomcc commented 2 years ago

FWIW, the smallest tables I know of are in https://bellard.org/quickjs/'s libunicode, which manages to fit all boolean properties, general categories, scripts, and script extensions in around 40kb. Many of them require an unpacking step, but several can be modified to have an index (and the code for that is already in the repo). It's worth taking a look, the basic idea is just to use a chunked RLE on most of the tables.

It's considerably smaller than the tables that libstd uses, but also slower, even with the index.