Alexhuszagh / rust-lexical

Fast numeric to- and from-string conversion routines.
Other
296 stars 38 forks source link

[FEATURE] Ignore/check base prefixes when using `format` *and* `radix` #47

Closed Evrey closed 3 years ago

Evrey commented 4 years ago

Problem

With format and radix enabled, I can e.g. parse hexadecimal floats using my proglang-specific syntax. However, that syntax includes base prefixes, in this case 0x. To make matters worse, base prefixes usually appear between the sign and the integer digits of a number literal.

Solution

0b, 0o, 0, 0d, 0x, as well as upper-case variants, are all common base prefixes in programming languages with 0b for base-2, 0o for base-8, 0 — as in leading zero — as a terrible way of saying 0o, 0d as an optional base-10 for the sake of symmetry, and 0x for base-16 numbers. I suggest ignoring leading 0 to mean base-8. That's just terrible, a source of countless bugs, and should thus be up to the user to work around.

While pretty much any radix is possible, I suggest only handling these four bases. I don't know of any common radices for others.

The following extensions to the format bit-packed config should be made:

This leaves 2 bytes and 4 bits reserved when using a u128 or a second u64 for the format settings.

Prerequisites

Alternatives

Currently I check the sign myself, memorise it, then skip sign and base prefix to radix-parse the number literal I got. This abuses the fact that flipping the sign of a float is a lossless operation. However, it's annoying and unergonomic.

An alternative design to support more radix prefixes would be to take a function pointer or something that maps base to a base-indicating u8 ASCII-char.

It's also worth noting that there are languages with base postfixes, like 03h in Intel x86 assembly. Should these be supported as well?

Alexhuszagh commented 3 years ago

@Evrey I like the NumberFormat flags approach, since it's:

  1. Backward compatible.
  2. Not enabled by default (only with feature = "format").
  3. Pretty trivial to implement.

In this case, I'm assuming we'd silently reset the provided radix (which might be a default value) to the one shown by the prefix. The post-fixes sound a little more difficult, but not too tricky, assuming that no digit separators are allowed. Since I'm assuming it's only ISAs that allow postfixes, which do not support digit separators, this could be a trivial check.

Also, this would only be for integers, correct? Or would you like to support floats? I'll have to check all my language scripts to see if any support base-prefixes with floats. See all the examples in here for scripts to check.

Alexhuszagh commented 3 years ago

Actually, for postfixes, we'd have to do it incrementally (which could work, easily, due to how we structure our logic) since we first extract the digits of the number, validate the format, then parse. In short, it would be easy to check if we have a non-digit character that appears, if that digit is a valid postfix.

Evrey commented 3 years ago

@Alexhuszagh The language i'm parsing does indeed have base prefixes for floats, too. I use hand-rolled code for integer parsing, because it more cleanly handles these bases, especially given that a delayed sign flip is not lossless for integers.

I know of at least C to have number base indicators on floats, too, but it uses a nonsensical format that is incompatible to lexical-core. But i don't C, so… I know more languages using these on floats exist, but i long forgot which.

I'll look into the example script tomorrow. Tonight is sleep time.

Alexhuszagh commented 3 years ago

The C format (and C++17) format for hexadecimal literals would actually be doable with the new ParseFloatOptions and the NumberFormat schemes designed.

const float f = 0x1ffp10;
const float g = 0x10.1p0;

Note that we wouldn't supported float formats like, since the f here is a type signifier (although this would work perfectly fine with the partial parser, it just wouldn't be able to deduce the type).

const auto f = .1E4f;

I would need to add support for hex prefixes into NumberFormat, but this would be entirely doable with:

let format = NumberFormat.builder()
    // Add flags here for hex prefixes.
    // We also would need a flag for required exponents, since
    // it's **never** optional.
    .exponent_backup(b'p')
    .build()
    .unwrap();

let options = ParseFloatOptions::builder()
    .format(Some(format))
    .build()
    .unwrap();

And then we'd have a valid parser for float hexadecimal literals in C++17 and C99. We would need one more addition: a way to specify the base for the exponent: hexadecimal float literals have hex digits for the integral and fractional components, and then decimal digits for the exponent, which signifies an exponent with base 2.

An example demonstrating this is:

#include <iostream>

int main() {
    std::cout << 0xa.bp10 
        << std::endl
        << 0xa.bp5 
        << std::endl;
    return 0;
}

Which then outputs:

10944    // 0x2ac0, or 10.6875 * 2^10
342        // 0x156, or 10.6875 * 2^5

In short, we have in 0xa.bp10:

Lexical already correctly parses 0xa.b as 10.6875, the only issue is specifying a hex literal format for it.

Alexhuszagh commented 3 years ago

So, the likely flags required would be:

So we currently need ~10 bits for just prefixes and postfixes (8 if 0xd is invalid), and 2 for exponent specifications. If so, with the release of v0.8, I'd likely change exponent_default in the ParseFloatOptions and WriteFloatOptions to exponent_decimal, making it clear exponent_backup would then be used for all non-decimal cases.

All of these prefixes and suffixes would have to be case-insensitive. It would also have to deal with non-sensical cases, which should be easy due to use tracking state. It would also assume that a prefix and a postfix would be mutually exclusive, that is, even if both are consistent, 0x3H is not a valid number. For example:

0b1010H  // invalid
0xAH     // invalid

This would also likely make the exponent backup character be used for all floats except those with base 10 (currently it is only used if the radix is too high to represent E). That is, 0b1010.1p5 would then be 10.5 * 2^5, or 336.

I'm going to try to find out if there's any exceptions I can find to these rules, because I'd try to avoid at all costs adding a exponent base to the NumberFormat that takes more than a single bit.

Evrey commented 3 years ago

@Alexhuszagh Yeah, for parsing C++ floats we need two number bases. One for the digits, one for what the exponent actually does. So far, and i'm glad it does, lexical-core makes exponents shift the numbers one digit at a time. It's what makes most sense, after all. In C++, as you mention, hex floats are shifted one bit at a time by their exponents. Or multiplied by 2^exponent. So you get this table, of how a literal is interpreted/parsed:

In Code Lexical C++
0b10.01^1 0b100.1 N/A
0d10.01^1 0d100.1 0d100.1
0x10.01^1 0x100.1 0x20.02

In other words, to support C++'s 0x3.4p7 number format, rust-lexical would need an extra parameter for the format spec. But given that the format spec is gonna be a nice builder struct, at least for sane number formats the exponent base can default to just the given number base. Like cxx_fmt.base(16).exp_base(2) or whatever.

The left column is the syntax of my programming language, btw., and that ^ for exponents made me write the »plz no global state« issue. =)

(although this would work perfectly fine with the partial parser, it just wouldn't be able to deduce the type)

All a compiler would need is parse at max precision and just return the unparsed rest.

E: Though really a compiler would've already lexed the number before parsing it and thus knows the type before invoking lexical-core.

making it clear exponent_backup would then be used for all non-decimal cases.

This is a difficult one. C++ has no binary float literals, and it only uses the p for exponents, because e is a valid hex digit. So really the exponent_backup is for when the exponent character is in the set of valid digit characters.

Technically one could have a different exponent character per number base and even make just this exponent distinguish the number base. Not that i've seen anyone actually do such a crazy thing. In my programming language, all bases use ^ for the exponent.

I shall further clarify that in my use-case, the base of a number literal is already identified by my lexer, including syntax validity, so if there is an easy/low-effort possibility to fast path lexical-core for when you already know the literal syntax to be correct, adding that to the API would be neat.

Another side note: Programming languages like Agda show that there may be interest in being able to parse Unicode numbers. However, that'd be so way different to normal literal parsing that even if lexical-core would ever do that in the far future, i'd recommend doing so using a different API. What i mean is stuff like: Half-width-no-break-space for digit separation, superscript digits for the exponent, final subscript digits for the number base, exponent digit actually written out with a multiplication sign.

So we currently need ~10 bits for just prefixes and postfixes

So, question… What if the argument to parsing-with-base was actually not just the numerical base, but another bit-packed struct? This is what i propose:

Have a base format builder with these settings:

The main format struct then implements just a subset of this for the default base-10 case.

This has a ton of advantages:

I'm going to try to find out if there's any exceptions I can find to these rules

Here's what i know:

Alexhuszagh commented 3 years ago

»plz no global state« issue. =)

Global state is bad to begin with, it was just a hack for performance and simple APIs at the time and it was a bad idea. Thanks for bringing it up again.

All a compiler would need is parse at max precision and just return the unparsed rest.

Yeah, sounds like a great idea. Also, C++'s user-defined literals opens a whole new can of worms I'm not eager to get into.

E: Though really a compiler would've already lexed the number before parsing it and thus knows the type before invoking lexical-core

Fair enough, and I'd assume this would be generally true.

This is a difficult one. C++ has no binary float literals, and it only uses the p for exponents, because e is a valid hex digit. So really the exponent_backup is for when the exponent character is in the set of valid digit characters.

Very true, so then we could just remove exponent_backup and only have exponent. This would save a lot of packed bits too.

Programming languages like Agda show that there may be interest in being able to parse Unicode numbers. However, that'd be so way different to normal literal parsing that even if lexical-core would ever do that in the far future, i'd recommend doing so using a different API. What i mean is stuff like: Half-width-no-break-space for digit separation, superscript digits for the exponent, final subscript digits for the number base, exponent digit actually written out with a multiplication sign.

Yeah, I'm not eager to add Unicode classification into the parser, as that would destroy performance. Although needed for correctness in a lot of cases, it's such a marginal case in float parsing it's not worth the tradeoffs. I do have an external library (that was originally written for integration into serde-json) that allows you to provide a custom iterator and then feed that into lexical, allowing you to provide any parsing logic you want as long as the input is. It currently does not support custom radixes, but that could be easily amenable (I'd likely do that in another library, since this really focuses on a single codebase with no features for blazing fast compile times).

We could use a similar concept for Unicode numbers, with characters like U+200C and U+200D being valid digit joiners or whatever, since it would be trivial for a custom iterator either to stop when it encounters such a digit or skip over them.

Have a base format builder with these settings:

Numerical digit base, e.g. 12 for duodecimal. Optional exponent base for C++, where it differs from the digit base. Exponent character. Optional prefix base indicator character. Optional postfix base indicator character.

This sounds... really clean and doable. I really like this idea.

Though really, base-12 has another issue, as traditionally the digits after 9 aren't AB, but XE/TE/somegreekstuff. But that's a whole other can of worms

Yeah, this wouldn't be doable currently and I'm not planning on supporting it, for obvious reasons. Internally, we simplify base conversions by using a single digit-to-integer table, which then is used to parse digits. Allowing custom vocabulary would either require passing around a reference to a pre-computed table (ok, I guess, especially if it only works [0x30-0x5A] range and is case-insensitive) or a function pointer (absolutely not). Greek characters would therefore absolutely be out of the question, but using an adaptation of minimal lexical above with a mapping iterator would be quite nice.

https://github.com/Alexhuszagh/rust-lexical/blob/master/lexical-core/src/util/digit.rs#L7-L38

Essentially, you could have an iterator that then maps lazily all the digits (as bytes or as characters) to return the expected digits (as bytes). This would allow supporting Greek characters in duodecimal notation, without actually requiring any complex changes internally. It might be possible to do this in lexical proper, since to support digit separators internally we map everything to iterators anyway, but we currently use an unusual way to track the number of bytes parsed to simplify our internal logic, which would require modification to support arbitrary iterators from unknown sources.

The »backup exponent« is now no longer needed, because each base format can specify its own exponent character.

I really like this part, and this is a great idea.

It leaves enough extra encoding space to add a single bit that decides whether simultaneously having base prefixes and postfixes in the same literal is legal. The fewer assumptions lexical-core hard-codes, the more attractive it is for any parser creator out there. (And again, in my case the legality of this is already decided by the lexer, so i'd just tell lexical-core that i don't care.)

Removing the exponent backup character gives us a lot more places to add bits, so absolutely, this sounds doable.

What if the argument to parsing-with-base was actually not just the numerical base, but another bit-packed struct? This is what i propose:

That sounds... like a good idea. It would mean the only actual character left in the format specifier would therefore be the digit separator, which sounds normal, and then another bit-packed struct with the following layout:

For each of these, \x00 (NUL) would be the invalid character, since it is invalid ASCII. This would require ~60 bits using 8 bits per character, and ~50 bits if we pack it even closer. This... looks good. Any suggestions?

For the decimal point, see #58 for why.

As for potentially supporting more esoteric duodecimal and other notation, I would have to look at seeing if we can track the number of digits written in a way suitable for iterators while keeping the code internally clean and performant.

Alexhuszagh commented 3 years ago

As far as the character for the exponent base and exponent radix, they would be stored effectively as a non-zero value, where 0 means "unset" (basically, Option<NonZeroU8>, just packed), so then if the values aren't provided, we default to the radix (making it optional).

Evrey commented 3 years ago

Also, C++'s user-defined literals opens a whole new can of worms I'm not eager to get into.

For lexical-core it just means parse at max precision and return the rest. =) When you apply a user-defined literal, your literal code has to manually handle either the text form of your float, or a pre-parsed float or double or what have you.

I do have an external library […] that allows you to provide a custom iterator and then feed that into lexical

That's a really interesting feature to have! I have no use for it, but that'd indeed make stuff like Unicode-conforming scientific number parsing with lexical-core possible, and would thus make lexical-core attractive for more scenarios, like parsing locale-aware number inputs of GUI applications.

Allowing custom vocabulary would either require passing around a reference to a pre-computed table

It wouldn't be impossible to generate such a table in a const fn given an input alphabet. But i do agree that this approach is better:

but using an adaptation of minimal lexical above with a mapping iterator would be quite nice.

This also allows redundant digits, where desired. Like allowing all, XE/TE/greekstuff/AB to mean 10 and 11 respectively. Or allowing Japanese Kanji as well as indo-arabic digits in I18N situations.

The only question is: Can lexical-core potentially handle negative digits for balanced ternary? Not that i need that or that i ever saw anyone asking for it, but you know, if lexical-core could pretty much handle almost every sane number format imaginable in this fashion, then why the heck would anyone want to use anything but lexical-core anymore for parsing and printing numbers.

and then another bit-packed struct with the following layout:

This sounds very reasonable and quite feature-complete. Good thinking with the separate exponent radix.

For each of these, \x00 (NUL) would be the invalid character, since it is invalid ASCII.

That's actually factually incorrect, though practically »true-ish enough«. Originally, NUL is a NOP for teletypes. You can pad data with that and it's ignored. C made that »invalid«. ASCII has no invalid character, Unicode has a set of non-characters which an application can utilise for any internal purposes it wants to, and UTF-8 has truly invalid bytes 0xC0 and 0xC1. That said, using NUL is Good Enough™. If you did ASCII-only, but on 8-bit characters, then anything past 0x7F would be truly illegal. Assuming you interpret that as 7-bit-ASCII-packed-in-8-bits, not as in 8-bit ASCII where the top bit is a parity bit. In any case, 0xFE would do for 8-bit.

That's a long way of saying: »Ehh, not exactly, but NUL makes most sense for the implementation, given NonZero types etc.«

This... looks good. Any suggestions?

Apart from that nit pick? Nah, sounds really good, indeed.

As for potentially supporting more esoteric duodecimal and other notation […]

I see them as not strictly necessary nice-to-haves that would make lexical-core even more attractive for a bigger set of potential users. But not supporting these would make it no less awesome. So adaptations for that can imho comfortably sit on the »maybe, if spare time« pile.

Alexhuszagh commented 3 years ago

This has been tentatively implemented as of https://github.com/Alexhuszagh/rust-lexical-experimental/commit/e062276cbf3c8fead353a6f67213bad5c772afe4. It passes all the unittests I've thrown at it so far, including Miri and Valgrind.

I'm currently waiting on fuzz results, then will apply the experimental branch on top of the current lexical branch, and we should have this release in the next week.

Here's a code sample of it:

const FORMAT: u128 = NumberFormatBuilder::new()
    .base_prefix(num::NonZeroU8::new(b'x'))
    .base_suffix(num::NonZeroU8::new(b'h'))
    .build();
let options = Options::new();

i32::from_lexical_with_options::<FORMAT>(b"+3h", &options);     // Ok(3)
i32::from_lexical_with_options::<FORMAT>(b"+0x3", &options);    // Ok(3)
i32::from_lexical_with_options::<FORMAT>(b"+0x3h", &options);   // OK(3)
i32::from_lexical_with_options::<FORMAT>(b"+0x3h ", &options);  // Err(InvalidDigit)
i32::from_lexical_with_options::<FORMAT>(b"+0xh", &options);    // Err(InvalidDigit)
i32::from_lexical_with_options::<FORMAT>(b"+h", &options);      // Err(InvalidDigit)
i32::from_lexical_with_options::<FORMAT>(b"+0x", &options);     // Err(Empty)

It currently supports both floats and integers, and the checks to skip/ignore base prefixes and suffixes are only done if the features are specified in the format packed struct.

Alexhuszagh commented 3 years ago

Implemented as of lexical v6.0.0 and lexical-core v0.8.0, using the API above.