Closed Evrey closed 3 years ago
@Evrey I like the NumberFormat
flags approach, since it's:
feature = "format"
).In this case, I'm assuming we'd silently reset the provided radix (which might be a default value) to the one shown by the prefix. The post-fixes sound a little more difficult, but not too tricky, assuming that no digit separators are allowed. Since I'm assuming it's only ISAs that allow postfixes, which do not support digit separators, this could be a trivial check.
Also, this would only be for integers, correct? Or would you like to support floats? I'll have to check all my language scripts to see if any support base-prefixes with floats. See all the examples in here for scripts to check.
Actually, for postfixes, we'd have to do it incrementally (which could work, easily, due to how we structure our logic) since we first extract the digits of the number, validate the format, then parse. In short, it would be easy to check if we have a non-digit character that appears, if that digit is a valid postfix.
@Alexhuszagh The language i'm parsing does indeed have base prefixes for floats, too. I use hand-rolled code for integer parsing, because it more cleanly handles these bases, especially given that a delayed sign flip is not lossless for integers.
I know of at least C to have number base indicators on floats, too, but it uses a nonsensical format that is incompatible to lexical-core
. But i don't C, so… I know more languages using these on floats exist, but i long forgot which.
I'll look into the example script tomorrow. Tonight is sleep time.
The C format (and C++17) format for hexadecimal literals would actually be doable with the new ParseFloatOptions and the NumberFormat schemes designed.
const float f = 0x1ffp10;
const float g = 0x10.1p0;
Note that we wouldn't supported float formats like, since the f
here is a type signifier (although this would work perfectly fine with the partial parser, it just wouldn't be able to deduce the type).
const auto f = .1E4f;
I would need to add support for hex prefixes into NumberFormat, but this would be entirely doable with:
let format = NumberFormat.builder()
// Add flags here for hex prefixes.
// We also would need a flag for required exponents, since
// it's **never** optional.
.exponent_backup(b'p')
.build()
.unwrap();
let options = ParseFloatOptions::builder()
.format(Some(format))
.build()
.unwrap();
And then we'd have a valid parser for float hexadecimal literals in C++17 and C99. We would need one more addition: a way to specify the base for the exponent: hexadecimal float literals have hex digits for the integral and fractional components, and then decimal digits for the exponent, which signifies an exponent with base 2.
An example demonstrating this is:
#include <iostream>
int main() {
std::cout << 0xa.bp10
<< std::endl
<< 0xa.bp5
<< std::endl;
return 0;
}
Which then outputs:
10944 // 0x2ac0, or 10.6875 * 2^10
342 // 0x156, or 10.6875 * 2^5
In short, we have in 0xa.bp10
:
0xa
, or 10
0xb
or 11
p10
, or 2^10
Lexical already correctly parses 0xa.b
as 10.6875
, the only issue is specifying a hex literal format for it.
So, the likely flags required would be:
0b
prefix (integer, float)0o
prefix (integer, float)0d
prefix (integer, float)?0x
prefix (integer, float)H
postfix (integer, float)So we currently need ~10 bits for just prefixes and postfixes (8 if 0xd
is invalid), and 2 for exponent specifications. If so, with the release of v0.8, I'd likely change exponent_default
in the ParseFloatOptions
and WriteFloatOptions
to exponent_decimal
, making it clear exponent_backup
would then be used for all non-decimal cases.
All of these prefixes and suffixes would have to be case-insensitive. It would also have to deal with non-sensical cases, which should be easy due to use tracking state. It would also assume that a prefix and a postfix would be mutually exclusive, that is, even if both are consistent, 0x3H
is not a valid number. For example:
0b1010H // invalid
0xAH // invalid
This would also likely make the exponent backup character be used for all floats except those with base 10 (currently it is only used if the radix is too high to represent E
). That is, 0b1010.1p5
would then be 10.5 * 2^5
, or 336
.
I'm going to try to find out if there's any exceptions I can find to these rules, because I'd try to avoid at all costs adding a exponent base to the NumberFormat
that takes more than a single bit.
@Alexhuszagh Yeah, for parsing C++ floats we need two number bases. One for the digits, one for what the exponent actually does. So far, and i'm glad it does, lexical-core
makes exponents shift the numbers one digit at a time. It's what makes most sense, after all. In C++, as you mention, hex floats are shifted one bit at a time by their exponents. Or multiplied by 2^exponent. So you get this table, of how a literal is interpreted/parsed:
In Code | Lexical | C++ |
---|---|---|
0b10.01^1 |
0b100.1 |
N/A |
0d10.01^1 |
0d100.1 |
0d100.1 |
0x10.01^1 |
0x100.1 |
0x20.02 |
In other words, to support C++'s 0x3.4p7
number format, rust-lexical
would need an extra parameter for the format spec. But given that the format spec is gonna be a nice builder struct, at least for sane number formats the exponent base can default to just the given number base. Like cxx_fmt.base(16).exp_base(2)
or whatever.
The left column is the syntax of my programming language, btw., and that ^
for exponents made me write the »plz no global state« issue. =)
(although this would work perfectly fine with the partial parser, it just wouldn't be able to deduce the type)
All a compiler would need is parse at max precision and just return the unparsed rest.
E: Though really a compiler would've already lexed the number before parsing it and thus knows the type before invoking lexical-core
.
making it clear
exponent_backup
would then be used for all non-decimal cases.
This is a difficult one. C++ has no binary float literals, and it only uses the p
for exponents, because e
is a valid hex digit. So really the exponent_backup
is for when the exponent
character is in the set of valid digit characters.
Technically one could have a different exponent character per number base and even make just this exponent distinguish the number base. Not that i've seen anyone actually do such a crazy thing. In my programming language, all bases use ^
for the exponent.
I shall further clarify that in my use-case, the base of a number literal is already identified by my lexer, including syntax validity, so if there is an easy/low-effort possibility to fast path lexical-core
for when you already know the literal syntax to be correct, adding that to the API would be neat.
Another side note: Programming languages like Agda show that there may be interest in being able to parse Unicode numbers. However, that'd be so way different to normal literal parsing that even if lexical-core
would ever do that in the far future, i'd recommend doing so using a different API. What i mean is stuff like: Half-width-no-break-space for digit separation, superscript digits for the exponent, final subscript digits for the number base, exponent digit actually written out with a multiplication sign.
So we currently need ~10 bits for just prefixes and postfixes
So, question… What if the argument to parsing-with-base was actually not just the numerical base, but another bit-packed struct? This is what i propose:
Have a base format builder with these settings:
The main format struct then implements just a subset of this for the default base-10 case.
This has a ton of advantages:
012
, and there's balanced ternary, where the digits are 0±1
, or 01i
or 0pn
or Wikipedia's 01T
or whatever one names them. Put another way: There may be demand for another base-dependent parsing function that takes e.g. a &[char]
or &[u8]
to name the digits of a number system, plus an offset where the zero digit is. Or for speed a function pointer that turns any char
into an Option<i8>
for the digit, None
indicating an error.lexical-core
hard-codes, the more attractive it is for any parser creator out there. (And again, in my case the legality of this is already decided by the lexer, so i'd just tell lexical-core
that i don't care.)I'm going to try to find out if there's any exceptions I can find to these rules
Here's what i know:
lexical-core
can just safely go: »I don't care, i'm just figuring out whether to skip this.«0
prefix to indicate octal. This could be handled as a special case when setting the base prefix character to 0
. Though on the other hand, if the parser already figured out that we'll be parsing base-8 here, to lexical-core
this 0
prefix can just be a leading zero. So no/optional prefix for octal.»plz no global state« issue. =)
Global state is bad to begin with, it was just a hack for performance and simple APIs at the time and it was a bad idea. Thanks for bringing it up again.
All a compiler would need is parse at max precision and just return the unparsed rest.
Yeah, sounds like a great idea. Also, C++'s user-defined literals opens a whole new can of worms I'm not eager to get into.
E: Though really a compiler would've already lexed the number before parsing it and thus knows the type before invoking lexical-core
Fair enough, and I'd assume this would be generally true.
This is a difficult one. C++ has no binary float literals, and it only uses the p for exponents, because e is a valid hex digit. So really the exponent_backup is for when the exponent character is in the set of valid digit characters.
Very true, so then we could just remove exponent_backup
and only have exponent
. This would save a lot of packed bits too.
Programming languages like Agda show that there may be interest in being able to parse Unicode numbers. However, that'd be so way different to normal literal parsing that even if lexical-core would ever do that in the far future, i'd recommend doing so using a different API. What i mean is stuff like: Half-width-no-break-space for digit separation, superscript digits for the exponent, final subscript digits for the number base, exponent digit actually written out with a multiplication sign.
Yeah, I'm not eager to add Unicode classification into the parser, as that would destroy performance. Although needed for correctness in a lot of cases, it's such a marginal case in float parsing it's not worth the tradeoffs. I do have an external library (that was originally written for integration into serde-json) that allows you to provide a custom iterator and then feed that into lexical, allowing you to provide any parsing logic you want as long as the input is. It currently does not support custom radixes, but that could be easily amenable (I'd likely do that in another library, since this really focuses on a single codebase with no features for blazing fast compile times).
We could use a similar concept for Unicode numbers, with characters like U+200C and U+200D being valid digit joiners or whatever, since it would be trivial for a custom iterator either to stop when it encounters such a digit or skip over them.
Have a base format builder with these settings:
Numerical digit base, e.g. 12 for duodecimal. Optional exponent base for C++, where it differs from the digit base. Exponent character. Optional prefix base indicator character. Optional postfix base indicator character.
This sounds... really clean and doable. I really like this idea.
Though really, base-12 has another issue, as traditionally the digits after 9 aren't AB, but XE/TE/somegreekstuff. But that's a whole other can of worms
Yeah, this wouldn't be doable currently and I'm not planning on supporting it, for obvious reasons. Internally, we simplify base conversions by using a single digit-to-integer table, which then is used to parse digits. Allowing custom vocabulary would either require passing around a reference to a pre-computed table (ok, I guess, especially if it only works [0x30-0x5A]
range and is case-insensitive) or a function pointer (absolutely not). Greek characters would therefore absolutely be out of the question, but using an adaptation of minimal lexical above with a mapping iterator would be quite nice.
https://github.com/Alexhuszagh/rust-lexical/blob/master/lexical-core/src/util/digit.rs#L7-L38
Essentially, you could have an iterator that then maps lazily all the digits (as bytes or as characters) to return the expected digits (as bytes). This would allow supporting Greek characters in duodecimal notation, without actually requiring any complex changes internally. It might be possible to do this in lexical proper, since to support digit separators internally we map everything to iterators anyway, but we currently use an unusual way to track the number of bytes parsed to simplify our internal logic, which would require modification to support arbitrary iterators from unknown sources.
The »backup exponent« is now no longer needed, because each base format can specify its own exponent character.
I really like this part, and this is a great idea.
It leaves enough extra encoding space to add a single bit that decides whether simultaneously having base prefixes and postfixes in the same literal is legal. The fewer assumptions lexical-core hard-codes, the more attractive it is for any parser creator out there. (And again, in my case the legality of this is already decided by the lexer, so i'd just tell lexical-core that i don't care.)
Removing the exponent backup character gives us a lot more places to add bits, so absolutely, this sounds doable.
What if the argument to parsing-with-base was actually not just the numerical base, but another bit-packed struct? This is what i propose:
That sounds... like a good idea. It would mean the only actual character left in the format specifier would therefore be the digit separator, which sounds normal, and then another bit-packed struct with the following layout:
For each of these, \x00
(NUL) would be the invalid character, since it is invalid ASCII. This would require ~60 bits using 8 bits per character, and ~50 bits if we pack it even closer. This... looks good. Any suggestions?
For the decimal point, see #58 for why.
As for potentially supporting more esoteric duodecimal and other notation, I would have to look at seeing if we can track the number of digits written in a way suitable for iterators while keeping the code internally clean and performant.
As far as the character for the exponent base and exponent radix, they would be stored effectively as a non-zero value, where 0 means "unset" (basically, Option<NonZeroU8>
, just packed), so then if the values aren't provided, we default to the radix (making it optional).
Also, C++'s user-defined literals opens a whole new can of worms I'm not eager to get into.
For lexical-core
it just means parse at max precision and return the rest. =) When you apply a user-defined literal, your literal code has to manually handle either the text form of your float, or a pre-parsed float
or double
or what have you.
I do have an external library […] that allows you to provide a custom iterator and then feed that into lexical
That's a really interesting feature to have! I have no use for it, but that'd indeed make stuff like Unicode-conforming scientific number parsing with lexical-core
possible, and would thus make lexical-core
attractive for more scenarios, like parsing locale-aware number inputs of GUI applications.
Allowing custom vocabulary would either require passing around a reference to a pre-computed table
It wouldn't be impossible to generate such a table in a const fn
given an input alphabet. But i do agree that this approach is better:
but using an adaptation of minimal lexical above with a mapping iterator would be quite nice.
This also allows redundant digits, where desired. Like allowing all, XE/TE/greekstuff/AB to mean 10
and 11
respectively. Or allowing Japanese Kanji as well as indo-arabic digits in I18N situations.
The only question is: Can lexical-core
potentially handle negative digits for balanced ternary? Not that i need that or that i ever saw anyone asking for it, but you know, if lexical-core
could pretty much handle almost every sane number format imaginable in this fashion, then why the heck would anyone want to use anything but lexical-core
anymore for parsing and printing numbers.
and then another bit-packed struct with the following layout:
This sounds very reasonable and quite feature-complete. Good thinking with the separate exponent radix.
For each of these,
\x00
(NUL) would be the invalid character, since it is invalid ASCII.
That's actually factually incorrect, though practically »true-ish enough«. Originally, NUL
is a NOP
for teletypes. You can pad data with that and it's ignored. C made that »invalid«. ASCII has no invalid character, Unicode has a set of non-characters which an application can utilise for any internal purposes it wants to, and UTF-8 has truly invalid bytes 0xC0
and 0xC1
. That said, using NUL
is Good Enough™. If you did ASCII-only, but on 8-bit characters, then anything past 0x7F
would be truly illegal. Assuming you interpret that as 7-bit-ASCII-packed-in-8-bits, not as in 8-bit ASCII where the top bit is a parity bit. In any case, 0xFE
would do for 8-bit.
That's a long way of saying: »Ehh, not exactly, but NUL makes most sense for the implementation, given NonZero
types etc.«
This... looks good. Any suggestions?
Apart from that nit pick? Nah, sounds really good, indeed.
As for potentially supporting more esoteric duodecimal and other notation […]
I see them as not strictly necessary nice-to-haves that would make lexical-core
even more attractive for a bigger set of potential users. But not supporting these would make it no less awesome. So adaptations for that can imho comfortably sit on the »maybe, if spare time« pile.
This has been tentatively implemented as of https://github.com/Alexhuszagh/rust-lexical-experimental/commit/e062276cbf3c8fead353a6f67213bad5c772afe4. It passes all the unittests I've thrown at it so far, including Miri and Valgrind.
I'm currently waiting on fuzz results, then will apply the experimental branch on top of the current lexical branch, and we should have this release in the next week.
Here's a code sample of it:
const FORMAT: u128 = NumberFormatBuilder::new()
.base_prefix(num::NonZeroU8::new(b'x'))
.base_suffix(num::NonZeroU8::new(b'h'))
.build();
let options = Options::new();
i32::from_lexical_with_options::<FORMAT>(b"+3h", &options); // Ok(3)
i32::from_lexical_with_options::<FORMAT>(b"+0x3", &options); // Ok(3)
i32::from_lexical_with_options::<FORMAT>(b"+0x3h", &options); // OK(3)
i32::from_lexical_with_options::<FORMAT>(b"+0x3h ", &options); // Err(InvalidDigit)
i32::from_lexical_with_options::<FORMAT>(b"+0xh", &options); // Err(InvalidDigit)
i32::from_lexical_with_options::<FORMAT>(b"+h", &options); // Err(InvalidDigit)
i32::from_lexical_with_options::<FORMAT>(b"+0x", &options); // Err(Empty)
It currently supports both floats and integers, and the checks to skip/ignore base prefixes and suffixes are only done if the features are specified in the format packed struct.
Implemented as of lexical v6.0.0 and lexical-core v0.8.0, using the API above.
Problem
With
format
andradix
enabled, I can e.g. parse hexadecimal floats using my proglang-specific syntax. However, that syntax includes base prefixes, in this case0x
. To make matters worse, base prefixes usually appear between the sign and the integer digits of a number literal.Solution
0b
,0o
,0
,0d
,0x
, as well as upper-case variants, are all common base prefixes in programming languages with0b
for base-2,0o
for base-8,0
— as in leading zero — as a terrible way of saying0o
,0d
as an optional base-10 for the sake of symmetry, and0x
for base-16 numbers. I suggest ignoring leading0
to mean base-8. That's just terrible, a source of countless bugs, and should thus be up to the user to work around.While pretty much any radix is possible, I suggest only handling these four bases. I don't know of any common radices for others.
The following extensions to the
format
bit-packed config should be made:0b
,0o
,0d
,0x
.0d
.b'd'
. If only upper-case was allowed in a format, this would beb'D'
. Leading0
is implied. If the format of the current radix has optional base-indicators, then all leading zeros behave normally.This leaves 2 bytes and 4 bits reserved when using a
u128
or a secondu64
for the format settings.Prerequisites
0.7.*
correct
,format
,radix
Alternatives
Currently I check the sign myself, memorise it, then skip sign and base prefix to radix-parse the number literal I got. This abuses the fact that flipping the sign of a float is a lossless operation. However, it's annoying and unergonomic.
An alternative design to support more radix prefixes would be to take a function pointer or something that maps
base
to a base-indicatingu8
ASCII-char.It's also worth noting that there are languages with base postfixes, like
03h
in Intel x86 assembly. Should these be supported as well?