dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
14.97k stars 4.66k forks source link

Support JsonNumberHandling in JsonElement & JsonNode #29152

Open a11e99z opened 5 years ago

a11e99z commented 5 years ago

many JSON looks like this: {"num":"123","str":"hello"} Number can be in quotes too

JsonElement.GetInt64() throws Exception that "we need Number but we have a String". I am sure that in quotes we have a Number, so code (GetInt64 and others) should ignore quotes and try parse number. and only in case when cannot parse Number(invalid chars) throw some Exception

UPD Oops! I used unclear title. better ..JsonElement should parse Numbers in qoutes

SUMMATION: 1) GetInt64() and etc should parse Numbers(long,uint,decimal..) for String elements: ignore first and last quotes. 2) add .AsSpan() method to JsonElement for cases when we want to do something with elements manually. we cannot access to element data and do something useful or work around with some problem (.GetString(), .ToString(), .GetRawString() - stress GC and they are not solution)

a11e99z commented 5 years ago

in case that u suppose solution long.Parse( elem.GetString() ) - well, we lose all the advantage of not allocating memory. JsonDocument loose any advantages in case when JSON has many numbers(long,int,decimal) in quotes and better to use Json.NET.

a11e99z commented 5 years ago

2) I can parse Number in quotes by myself but I cannot to get span of element or JSON-position (like in Utf8JsonReader.Position) so, add method to JsonElement.AsSpan() or something. It simple and let us to get around a lot of problems cause we can do something with the elements manually.

ahsonkhan commented 5 years ago

cc @bartonjs

bartonjs commented 5 years ago

add method to JsonElement.AsSpan() or something.

That's something we're explicitly keeping out of the API. JsonDocument and JsonElement can apply over UTF-8 or UTF-16 data, exposing the span removes that abstraction.

Number can be in quotes too

The easy question is "why is the number in a string instead of just being a number?". Once strings start being parsed there become questions of acceptable formats, and culture sensitivity, et cetera. e.g. the number "123,456" is either bigger than 1000 (en-US) or between 100 and 1000 (en-UK).

If I understand the flow correctly, JValue's int-conversion only uses the "N" format, removing the culture and format problems.

I don't suppose you have any sort of data suggesting how popular it is to transmit numbers as JSON strings instead of JSON numbers?

in case that u suppose solution long.Parse( elem.GetString() ) - well, we lose all the advantage of not allocating memory.

It's still significantly lower allocation than the equivalent in JValue, since this would allocate the number-string into gen0, you'd parse it, you'd lose the reference, and it'll avoid getting promoted to gen1. In JsonTextReader to JValue the (sub)string got allocated during document traversal (probably getting promoted to gen1) along with lots of other short strings, and then much later is parsed (while still staying as a string held alive by the JValue).

a11e99z commented 5 years ago

sometimes you've got data from server where u cannot change anything (Bitcoin exchange Binance for example). // timestamp: no quotes. price change: have quotes [{"e":"24hrTicker","E":1554206291885,"s":"ETHBTC","p":"-0.00230300"... and many more numbers with quotes. API description https://github.com/binance-exchange/binance-official-api-docs/blob/master/web-socket-streams.md

well, we have data and we should to do something with this. when objects dont allow do something useful - such objects will not be used. we live in the world where have many colors not just black and white. so, my request is not to do worse, is to do better with things that we have. all job that needed is couple lines of code. (I am not sure for that coz many method invocations inside)

write now my codegenerator looks like (used Roslyn Scripting): cod.AppendLine( $"if (nod.TryGetProperty( \"{fld.Name}\", out prop )) " ); if (ftyp == typeof( sbyte ) || ftyp == typeof( short ) || ftyp == typeof( int ) || ftyp == typeof( long )) cod.AppendLine( $" res.{fld.Name} = ( System.{ftyp.Name} )(prop.Type == System.Text.Json.JsonValueType.String ? long.Parse( prop.GetString()) : prop.GetInt64());" ); // long.Parse( prop.GetString()) - "failed" use object.. better to use Json.NET and loose couple of milliseconds. last line instead that I expected to work well (line where I dont expect Exception) cod.AppendLine( $" res.{fld.Name} = ( System.{ftyp.Name} )prop.GetInt64());" );

with prop.AsSpan() I can (try to) parse span manually without any allocation. with prop.GetInt64() that parses Numbers in quotes I have simple working code without any allocation.

well, GetXXX() that parses numbers in quotes solve my problem and dont need add AsSpan(). but AsSpan() will allow to use some hacks/optimization/access to JSON-element data in scenarios that we cant see right now.

my point is that we have numbers inside quotes in JSON and we can't change this. so we need some methods that allow us to parse those numbers without any string allocations. I dont know (dont care for sure) how to do this, maybe change current methods maybe add new ones GetStringAsInt64(), xxxUInt64, xxxDouble, xxxDecimal will be enough, no need for float, byte, int, ushort..

a11e99z commented 5 years ago

OFFTOPIC:

well, I can try to use Utf8JsonReader for my task and (I didnt try yet but IMO) it will increase my code twice and harder code-logic maybe 5 times.

maybe later I will want my code even faster and rewrite it (or will hire u to rewrite it :) ) for Utf8JsonReader but now I want simple MVP. next level will be native codegeneration (Roslyn, LLVM..), next - FPGA & HFT..

in trading you dont need any allocations even in gen0 or Edens. u just cant stop (all) thread(s) for 10-50ms.

we can save 5ms in one type message from exchange vs Json.NET.. we have 10 message types, 100 instruments/securities, 10 another exchanges with 5 of them with numbers in quotes.. so u can save for 1second of real time >1second of CPU (many threads) of parse/GCcollect work that we can to spend to more useful work for real problem solution but not to fight with developer tasks.. cent saves the dollar.. millisecond saves the second

Gnbrkm41 commented 5 years ago

I don't suppose you have any sort of data suggesting how popular it is to transmit numbers as JSON strings instead of JSON numbers?

+1, I'm no JSON expert, but from what I can see I'm not sure if it is a common scenario to serialise numbers as string type. https://stackoverflow.com/a/15368259/10484567 A little bit of search seem to suggest that it's not a number once it's been quoted.

I'm against adding supports for non-standard usages of common protocols. I think the benefits of implementing this in BCL are not worth the amount of efforts needed to do so.

Gnbrkm41 commented 5 years ago

@bartonjs by "between 100 and 1000 (en-UK)", I believe you've meant 'en-GB' 😀. The UK also uses full stop as decimal points, so '123,456' would be "one hundred and twenty three thousand, four hundred and fifty six". The point still holds in majority of European countries though. (e.g. 'fr-FR', 'de-DE'...)

Gnbrkm41 commented 5 years ago

dotnet/corefx#36639 Could this solve your problem (if implemented)?

bartonjs commented 5 years ago

by "between 100 and 1000 (en-UK)", I believe you've meant 'en-GB' grinning

Yep, oops 🤭. Though, looking at an old culture document I built it looks like en-UK doesn't (didn't?) exist, and en-GB used comma-and-period the same as en-US. de-DE, OTOH, reverses them (like I thought "en-UK" did). -- But the write time on that doc is 2009, so maybe things have changed in the last 10 years :smile:.

Wildenhaus commented 5 years ago

... from what I can see I'm not sure if it is a common scenario to serialise numbers as string type ...

When working with financial data, it is an industry standard to stringify numeric types in order to avoid rounding errors. In fact, this is part of the reason that the Decimal type has been introduced.

It would make perfect sense for Utf8JsonReader to at the very least allow the parsing of strings when calling GetDecimal, if not for other floating point types as well.

bartonjs commented 5 years ago

When working with financial data, it is an industry standard to stringify numeric types in order to avoid rounding errors

But JSON numbers are already stringified. The question is, does the financial industry send JSON like

{ "amount": "18.13" } or { "amount": 18.13 }. The former is a JsonValueType.String, the latter is a JsonValueType.Number, which is a still a string on the wire.

Wildenhaus commented 5 years ago

But JSON numbers are already stringified.

True, but that doesn't consider platform-specific type handling. For instance, the JavaScript spec defines floating point numbers as 64-bit IEEE 754. Developers will use libraries like decimal.js or BigInteger in order to avoid the precision limitations set by the spec.

Consider what would happen if a JavaScript client tried to parse the number 0.987765432109876543210 from JSON. If it were in a number format, it would be rounded to 0.98776543210987655, whereas if it were in string format, it could be handled differently and sent to a library supporting a larger precision. In the grand scheme of things, the way floating point numbers are handled is largely dependent on the platform consuming the serialized data.

Coinbase, which is the leading cryptocurrency exchange based in the United States, states that they have adopted the practice of stringifying floating point numbers for this exact reason:

Decimal numbers are returned as strings to preserve full precision across platforms. When making a request, it is recommended that you also convert your numbers to strings to avoid truncation and precision errors. Integer numbers (like trade id and sequence) are unquoted.

The question is, does the financial industry send JSON like { "amount": "18.13" } or { "amount": 18.13 }

Actually, I'm almost certain that the most prevalent technique is to just ditch floating point numbers altogether in favor of 64-bit integers with separate decimal place tracking, but that can cause issues if the number happens to exceed a 64-bit integer's range.

It looks like JSON Schema has decided on this feature, but hasn't implemented it yet. Might be worth keeping an eye on.

Gnbrkm41 commented 5 years ago

That looks fair to me.

manigandham commented 5 years ago

Javascript is limited to 53 bits of precision so all 64-bit integers will be truncated. We have to use strings in JSON so that these values are transmitted correctly. This is standard industry practice because there's no other option.

Having the ability to both parse and write 64-bit integers as strings is a must, especially for JSON-based APIs.

NickCraver commented 5 years ago

I have have hit this as well, but didn't find this issue (I'm using the .Parse<T>() API). Raised here: https://github.com/dotnet/corefx/issues/39473

manigandham commented 5 years ago

Now possible using the new Json converters on 3.0 preview-7

    public class LongToStringSupport : JsonConverter<long>
    {
        public override long Read(ref Utf8JsonReader reader, Type type, JsonSerializerOptions options)
        {
            if (reader.TokenType == JsonTokenType.String && Int64.TryParse(reader.GetString(), out var number))
                return number;

            return reader.GetInt64();
        }

        public override void Write(Utf8JsonWriter writer, long value, JsonSerializerOptions options)
        {
            writer.WriteStringValue(value.ToString());
        }
    }
services.AddControllersWithViews()
   .AddJsonOptions(options =>
   {
        options.JsonSerializerOptions.Converters.Add(new LongToStringSupport());
   });
zelyony commented 5 years ago
if (reader.TokenType == JsonTokenType.String && Int64.TryParse(reader.GetString(), out var number))
                return number;

Point of this issue is NOT to use GetString() for JsonTokenType.String that is really decimal or long.

manigandham commented 5 years ago

@zelyony

Yea, it can be better. Here's an updated version to use the same Utf8Parser code:

public class LongToStringConverter : JsonConverter<long>
{
    public override long Read(ref Utf8JsonReader reader, Type type, JsonSerializerOptions options)
    {
        if (reader.TokenType == JsonTokenType.String)
        {
            ReadOnlySpan<byte> span = reader.HasValueSequence ? reader.ValueSequence.ToArray() : reader.ValueSpan;
            if (Utf8Parser.TryParse(span, out long number, out int bytesConsumed) && span.Length == bytesConsumed)
                return number;
        }

        return reader.GetInt64();
    }

    public override void Write(Utf8JsonWriter writer, long value, JsonSerializerOptions options)
    {
        writer.WriteStringValue(value.ToString());
    }
}
ahsonkhan commented 5 years ago

@manigandham, @zelyony - note that the latest converter sample that was shared won't work for numbers that are escaped since you are using the ValueSpan property to get the raw bytes (the previous sample using GetString() would work though). In case that scenario matters for your use case.

Say, the digit 2 was instead escaped as \\u0032.

const string json = "\"123456789010\\u0032\"";

var options = new JsonSerializerOptions();
options.Converters.Add(new LongToStringConverter());

// throws System.InvalidOperationException : Cannot get the value of a token type 'String' as a number
// since Utf8Parser.TryParse returned false
long val = JsonSerializer.Deserialize<long>(json, options);
Assert.Equal(1234567890102, val);

string jsonSerialized = JsonSerializer.Serialize(val, options);
Assert.Equal("\"1234567890102\"", jsonSerialized);
manigandham commented 5 years ago

@ahsonkhan Got it, thanks for the note. I guess both statements can be combined to handle that situation as well:

public class LongToStringConverter : JsonConverter<long>
{
    public override long Read(ref Utf8JsonReader reader, Type type, JsonSerializerOptions options)
    {
        if (reader.TokenType == JsonTokenType.String)
        {
            ReadOnlySpan<byte> span = reader.HasValueSequence ? reader.ValueSequence.ToArray() : reader.ValueSpan;
            if (Utf8Parser.TryParse(span, out long number, out int bytesConsumed) && span.Length == bytesConsumed)
                return number;

            if (Int64.TryParse(reader.GetString(), out number))
                return number;
        }

        return reader.GetInt64();
    }

    public override void Write(Utf8JsonWriter writer, long value, JsonSerializerOptions options)
    {
        writer.WriteStringValue(value.ToString());
    }
}
shaig4 commented 4 years ago

I Have same issue with booleans, decimals and dates, should be added a flag to enable quoted values.

manigandham commented 4 years ago

@shaig4 - There's a comment about that option in the other issue: https://github.com/dotnet/corefx/issues/39473#issuecomment-526245607

Maybe these issues should be merged to avoid duplicate discussions...

steveharter commented 4 years ago

Linking https://github.com/dotnet/corefx/issues/40120 as supporting quoted numbers in the the deserializer for dictionaries is a feature ask.

huoyaoyuan commented 4 years ago

Please do this at low level if possible - definitely we don't want to allocate temporary strings like naïve converters will do.

eiriktsarpalis commented 2 years ago

A few updates on this issue:

var options = new JsonSerializerOptions { NumberHandling = JsonNumberHandling.AllowReadingFromString };
var node = JsonSerializer.Deserialize<JsonNode>(@"{ ""num"" : ""1""}", options);
node["num"].GetValue<int>(); // System.InvalidOperationException: An element of type 'String' cannot be converted to a 'System.Int32'.

@steveharter is this something we should consider addressing in a future release?

steveharter commented 2 years ago

I do think having pushing these simpler serializer-only features like quoted numbers and JsonIgnoreCondition down to node and element is goodness. Some may even make sense at the reader\writer level.

Since JsonNode is layered on JsonElement during deserialization, to support the various serializer-related features for node we would need to: