Closed inquisitivecrystal closed 2 years ago
@BurntSushi Sorry for the nag, but can I ask for an update on this? If this is too large, I'd be happy to seperate out the portion that we actually need, extracted/DerivedNumericValues.txt
, to make it easier to review.
Thanks so much for merging this. I really appreciate it, especially as I know things have been so busy for you. I'm glad https://github.com/rust-lang/rust/issues/84056 is finally unblocked! 🎉
No problem and sorry it took so long! Incidentally, I didn't realize this was blocking work for std (although I now see you did link it in your initial comment, whoops). Is ucd-generate
used to generate the Unicode tables for std? I didn't know about that.
Yep, the unicode-table-generator tool used to make the standard library's unicode tables uses ucd-parse
for parsing. It does its own table generation, because the space requirements of the standard library are a bit bespoke.
I also mentioned the std element a few times in our Zulip conversations. It's not a huge deal though, especially as it seems like the work this was blocking may not be a good idea anyway.
@inquisitivecrystal Ah interesting. I bet some of those space saving tricks would be useful for regex-syntax
too. See #30 and #39 for some ideas in this direction.
I also mentioned the std element a few times in our Zulip conversations. It's not a huge deal though, especially as it seems like the work this was blocking may not be a good idea anyway.
Ug sorry. Yeah, my brain has been a pile of mush for the past couple of years. I've only recently just started coming back up for air and getting more time to devote to projects.
To elaborate a bit more on regex-syntax
, basically, the tables are read when compiling a regex and not when searching. While regex compilation still needs to be reasonably fast, it would be acceptable to make using the Unicode tables slower if it allowed us to make the tables smaller. As it stands currently, I've basically invested no work or time in shrinking the tables at all. They are just sorted sequences of codepoint ranges. regex-syntax
embeds a considerable amount of Unicode data (which can be disabled using Cargo features at least), but all of it is included by default. So space savings there would be a huge win.
So what I'm trying to say is that if rustc
has these super optimized/compressed formats for codepoint tables, it could be worth porting them to ucd-generate
. With that said, it can be frustrating to rely on an external project for such a key thing inside of std. But, I wanted to throw it out there that there is almost certainly demand for the Herculean efforts being made elsewhere. :-)
FWIW, the smallest tables I know of are in https://bellard.org/quickjs/'s libunicode, which manages to fit all boolean properties, general categories, scripts, and script extensions in around 40kb. Many of them require an unpacking step, but several can be modified to have an index (and the code for that is already in the repo). It's worth taking a look, the basic idea is just to use a chunked RLE on most of the tables.
It's considerably smaller than the tables that libstd uses, but also slower, even with the index.
Rust needs the ability to parse
extracted/DerivedNumericValues.txt
as part of rust-lang/rust#84056. This adds parsing support for that file and all the other files underextracted/
.