eirproject / eir

Erlang ecosystem common IR
Apache License 2.0
250 stars 8 forks source link

Split out string interning into separate crate #14

Closed hansihe closed 5 years ago

hansihe commented 5 years ago

String interning currently lives in the eir crate. Since we only want to deal with actual string data in the frontend, this requires several other crates to depend on eir that really shouldn't. This also needs to be considered when merging the Erlang frontend.

It would probably make sense to make string interning live in an utility crate

bitwalker commented 5 years ago

I'd suggest we take the interning bits and arena code from #12 and make it a new crate. I based the implementation on the rustc internals, as it's the most ergonomic interning API I've found yet (for a compiler anyway), and is especially useful for tokenizing source code, since each symbol is only stored in memory once, so ASTs only end up storing usize integers, rather than arbitrarily large strings.

archseer commented 5 years ago

There's also external implementations that I considered using previously: from servo. I built this one, but the manual work involved in defining all the constants is not great. Or from nox.

Whatever is decided on, I'll probably end up using too :)

bitwalker commented 5 years ago

I think there's two use cases too, one for interning strings during compilation, and one for runtime atom tables, and the design considerations/tradeoffs are different for both.

I like nam actually, hadn't seen it yet, and it seems like a good fit for a runtime, but maybe not ideal for a compiler.

Performance-wise, you don't want locks at all in a compiler if you can avoid it (especially the tokenizer), and it's perfectly fine to have separate interned stores for multiple threads, as long as symbols aren't crossing thread barriers without being translated/interned in the new thread, which isn't nearly as common as doing most all processing for some set of files in a single thread.

In a runtime atom table like Erlang's though, you do need something Sync and writable, without requiring exclusive access for reads, so the design in nam and the others like it make a lot of sense there.

That said, I'm not super opinionated one way or the other, but I do think it may be worth making the choice separately for a compiler and runtime.

hansihe commented 5 years ago

Done in https://github.com/eirproject/eir/pull/15