Closed ericwj closed 3 years ago
I have tried setting JsonSerializerOptions.AllowTrailingCommas
but that didn't make a difference. If it would do what I expect it still wouldn't help if the JSON isn't followed by a comma.
@ericwj can you share your input JSON payload that is causing the issue and also the .NET object model you are deserializing to?
Also, what version of dotnet sdk are you using (share the output of dotnet --info
)? Can you try with preview 9 bits (or latest nightly)?
Please try the nightly SDK/runtime with the fix to verify it works (https://github.com/dotnet/core-sdk#installers-and-binaries), or wait till preview 9 ships. Alternatively, you could reference the latest S.T.Json NuGet package. For example:
<ItemGroup>
<PackageReference Include="System.Text.Json" Version="4.6.0-preview9.19413.13" />
<PackageReference Include="System.Text.Encodings.Web" Version="4.6.0-preview9.19413.13" />
</ItemGroup>
If your scenarios is the following (and you want to just read the first json object), then that is not supported today. We only support reading a single JSON object:
File containing more than one payload isn't supported:
{ ... some json...}, { ... another json ... }
This is also how the underlying Utf8JsonReader
behaves (which is where the exception you observed is coming from).
string json = "{},"
byte[] input = Encoding.UTF8.GetBytes(jsonString);
var reader = new Utf8JsonReader(input,
new JsonReaderOptions() { AllowTrailingCommas = true });
while (reader.Read())
;
// After EndObject throws System.Text.Json.JsonReaderException :
',' is invalid after a single JSON value. Expected end of data. LineNumber: 0 | BytePositionInLine: 2
Yes I am trying to read in a streaming fashion, so there is JSON following other JSON. The files can become fairly big.
I think I might be able to use a Pipe
and an Utf8JsonReader
to do the job.
Although it is quite a lot of code overhead, working with SequencePosition
is awkward and JsonSerializer.Deserialize<T>(ref reader, ...)
will throw exceptions as long as the buffer isn't valid yet while I search for the next '}'
. Wouldn't it be better if there was a bool TryDeserialize<T>
anyway?
Wouldn't it be better if there was a
bool TryDeserialize<T>
anyway?
What would you expect TryDeserialize to do/return in your scenario where you have multiple json payloads within the file and it fails to deserialize? The deserialize call today throws, so I don't see returning false in the Try* API would help all that much. cc @steveharter
I think I might be able to use a
Pipe
and anUtf8JsonReader
to do the job. Although it is quite a lot of code overhead, working withSequencePosition
is awkward
Yep, that's true. Let me think about the scenario you presented and get back to you (I want to see how much effort is required if a motivated dev wants to support this case via the low-level reader). If you already have a pipe/utf8jsonreader-based implementation that works, please share.
A quick and dirty implementation that works as long as the input is okay is here.
What would you expect TryDeserialize to do/return in your scenario where you have multiple json payloads within the file and it fails to deserialize?
I would like it to return false
or some error status and not throw, because exceptions are wicked expensive. Something like this:
class JsonSerializer {
public static bool TryDeserialize<T>(
ref Utf8JsonReader reader,
out T result,
JsonSerializerOptions options);
}
The pipe reader code will search for [
, {
and their closing counterparts, where any unbalanced object tags will cause an exception calling JsonSerializer.Deserialize<T>
. A naive way is just to try to deserialize for any }
found, hoping it will match the first opening brace in the JSON remaining at that point, at the cost of a try...catch
if the braces aren't balanced. Less naive might be to balance {
's with }
's, but that comes at the cost of more calls to ReadOnlySequence<byte>.PositionOf(byte)
or yet higher code complexity and more awkward SequencePosition
math and comparisons.
I don't immediately see the need for SequencePosition
in favor of plain int
s or long
s at all, can the tracking of the _object
field not be avoided by just having overloads that take some integral offset? Or it has to become something proper that is comparable and has suitable operators defined. But then still I don't see why I have to deal with them and drag that _object
field along.
I seriously don't think using the Utf8JsonReader
directly - that is, without calling into JsonSerializer
- is beneficial except if performance is supercritical and you have to deserialize a lot of JSON. I don't think I might ever go that route except in a rare case when the schema is very, very limited and the volume to read/write justifies the effort.
Given this is currently by-design, and requires some thinking/design work to support this feature, moving to 5.0.
I ran into this issue myself and ended up working around it by making a custom Stream
class that has some minimal understanding of JSON syntax and, provided the JSON is valid, can determine when it has reached the end of a JSON value. One thing I ran into, though, is that in some cases, the end of the JSON value is detectable only by reading a character that isn't part of the value, and if Deserialize
simply returned at that point, that character would be lost, since generic Stream
doesn't have a way to "push back" a character to allow a future call to reread it. My custom Stream
takes this into account, storing this potential "pushed back" character. But, this is a really messy solution, and I don't see a way to integrate partial reads of the source data into the existing model for Deserialize
without running into this problem. But, a thought that occurred to me is that perhaps JsonSerializer
could present a method specifically for deserializing a JSON array into an IEnumerable
or IAsyncEnumerable
, yielding each item as it is deserialized. Lacking a full async deserialization infrastructure, each of those items will have to be fully deserialized in order to be returned (e.g. even if it contains a nested array, that nested array cannot itself be returned piecemeal), but that is its own extremely difficult problem to solve in a way that I believe doing this at the top level is not. A method as proposed would basically need to itself recognize [
, ]
, and ,
, and then would need internally to have a Deserialize
method that could return a "push back" character and also use one passed back in from a previous call, instead of requiring that the end of the value coincide with the end of the stream.
Actually if it were only being used by the enumerate-array-elements method it wouldn't need to support receiving a pushback character, as the character in question would always be either ,
or ]
and would be consumed by the outer method.
In case it is not obvious, specifically, if the value being deserialized is a string, an object or an array, then the character that tells you you are at the end of the JSON value is a part of the value itself and no further character needs to be read. But, if it is a number, you only know that you're past the end of the number once you read the first character that isn't part of the number. Technically speaking, when reading true
, false
or null
, the last letter of the word, with well-formed JSON, unambiguously marks the end of value, but if the value is read up until a clear "end of identifier", that end is also past the end of the JSON value.
@logiclrd - You're making things hard on yourself. Assuming you've got some sort of delegating stream which is buffering the contents, you don't bother to rewind the stream, you just don't move the offending characters into the buffer (or rather, don't mark them as part of the buffer).
,
followed immediately by ,
).{
) or start-of-array ([
) is found.
CopyTo
considers it index 0)This unfortunately iterates over the stream an extra time (but only once), but is simple to implement and doesn't require messing around with states of readers.
This unfortunately iterates over the stream an extra time (but only once)
CanSeek
is not always true. This is not a general solution. In my specific case, the underlying stream is a TCP connection. The entire purpose of my implementation is to avoid buffering and yield elements of the array as they arrive, and I think that would be a very useful thing to have in general. :-)
I have had a similar issue. I solved it by deleting the file before writing serialized JSON in it. If not do the deletion, the System.Text.Json.JsonSerializer
will overwrite the file and leave the previous extra data unchanged.
I hope my experience could solve the issue for someone else.
It's good to point this out in case people with corrupt data end up on this issue by searching for the error message, but I want to make it clear that this issue isn't about corrupt data, but about a design limitation in System.Text.Json that makes it impossible to intentionally decode a JSON value from the middle of a stream.
Note that there is another approach you can use: call stream.SetLength(stream.Position)
after serializing to it but before closing the file.
I ran into a similar issue with my json: {"StorageIdentifier":"aafb1460-8a30-49f4-89be-5ccf2651248d","Subscription":"c728de4b-07f3-488f-b9eb-58508281525b"}
I can't figure out what is the problem
Duplicate of #33030, specifically the discussion about SupportMultipleContent
-like functionality.
This does not seem like a duplicate of #33030. #33030 is about reading multiple independent JSON documents in succession from a stream, while this one is about reading a part of a single JSON document.
Wouldn't a SupportMultipleContent
-like feature serve to address both use cases though?
At the very least the other one is the duplicate - it is three quarters of a year newer than this issue.
Looks like about the same although I think SupportMultipleContent
is misleading. Can you just make it stop parsing when it's done please?
At the very least the other one is the duplicate - it is three quarters of a year newer than this issue.
I should point out that me closing this or the other issue is not assigning credit -- we just need to keep a focused backlog. It just happens that the other issue was triaged ahead of time and as such contains more recent information. I would recommend upvoting the other issue or contributing to the conversation there.
Looks like about the same although I think SupportMultipleContent is misleading. Can you just make it stop parsing when it's done please?
Why do you think it's misleading? The default behavior is to fail on trailing data and we're not changing that. We can discuss adding an optional flag for disabling this, just like Json.NET is doing.
I should point out that me closing this or the other issue is not assigning credit
Thats fine.
Why do you think it's misleading? The default behavior is to fail on trailing data and we're not changing that.
I think the name SupportMultipleContent
is. I totally understand most use cases need to fail for trailing bytes. My point is that I started reading at some random point in a stream and it didn't care about that either. It also shouldn't care what follows, JSON which I read as 'multiple content', newline, or emoji, or 0xdeadbeef
or 0xcc
, or an access violation.
Sure, this is not an API proposal, it's merely pointing out the name of the feature in Json.NET
I still feel like these are fundamentally not the same issue.
The other issue is about allowing parsing in situations where the source data is not syntactically valid JSON, because it is smushing multiple documents together. This issue is about making the parsing of JSON arrays piecewise -- it's still a single, valid document, I just don't want to parse the whole thing in one call. But, I want the parsing state preserved so that I can go back to it and get the next piece.
These are not the same functionality.
EDIT: I thought I was commenting on a related issue that I had encountered. Rereading the top of this thread, @ericwj is not coming at this from quite the same place I am, and my issue was a comment further down the thread.
Just looking through how this works, I also think this issue should be reopened, over this request: TryDeserialize
overloads.
Although neither the ,
nor the end-of-lines in #33030 are really a problem when the reading is from a span and the objects being read are known to be fairly small and can hence be definitely fully in the span such that Utf8JsonReader
will happily succeed calling JsonSerializer.Deserialize<T>(ref Utf8JsonReader)
, stuff gets very cumbersome when the reading is from a stream like in this original issue's comment. Most of these scenario's get even more complicated when reading arbitrary objects, asynchronously. And one reason why I struggle to get anything custom implemented is because all I can do is try and catch and that just stinks for being...smelly, and for performing very badly for no reason.
In the former case, with small objects, in spans, or single read results, @logiclrd is also helped using the pair Utf8JsonReader.CurrentState
and new UTf8JsonReader(Span/Sequence, bool, JsonReaderState)
. But this is seriously unworkable as a general solution over the lack of TryDeserialize
on JsonSerializer
and it is also just a lot of code - which would lead to #33030, but I can imagine whatever solution you create for that, someone will have some special requirements and still want to hand roll something and hence stumble on the limited performance of catches/second possible.
EDIT: I thought I was commenting on a related issue that I had encountered. Rereading the top of this thread, @ericwj is not coming at this from quite the same place I am, and my issue was a comment further down the thread.
@logiclrd is this presumably related to https://github.com/dotnet/runtime/issues/30405#issuecomment-739567361? I'm not sure I entirely understand the first part of the post, so I would recommend creating a new issue containing a reproduction of the behavior so we can make a recommendation. Regarding the second part, .NET 6 does ship with a DeserializeAsyncEnumerable
method for streaming root-level JSON arrays.
@ericwj a related request was discussed in https://github.com/dotnet/runtime/issues/54557#issuecomment-869836263. Per design guidelines Try-
methods are only permitted to return false
for one and only one failure mode. So a method that's equivalent to doing catch { return false; }
would most likely not be accepted since it's collapsing a large number of failure modes.
Mm, DeserializeAsyncEnumerable
I believe does cover my usage. It only applies to arrays at the root, but that's precisely my situation.
@ericwj a related request was discussed (...) design guidelines (...)
This too is not an API proposal.
I'd be fine with error codes, preferably the most reasonable ones defined in an enum while there is a whole series of errors for which its fine to still throw in this case, too - like out of memory, etc. but also say invalid characters for the current JavaScript decoder.
Other than that, for the use case at the start of the article, all that would be needed is bool TryDeserialize<T>
which returns false only in one or two specific cases, primarily when there is not enough data available.
The benefit of the former variant would be that much higher levels of customizations can be achieved at very high performance. Would still need to instruct it to succeed without error if all that is wrong is that data follows the JSON, whatever how long the valid part is.
Given the scenario where you just want a single value, I'd probably take the approach of passing in a JsonPath (e.g. "$.MyObject.MyPropertyToReturn" to a helper like:
static async Task<byte[]> GetJsonFromPath(Stream utf8Json, string jsonPath, JsonSerializerOptions options);
// Return the raw bytes to be deserialized in any way (serializer, node, element)
or a helper that directly deserializes those bytes using the serializer\node\element which will avoid an extra byte[] allocation:
static async Task<TValue> GetValueFromPath(Stream utf8Json, string jsonPath, JsonSerializerOptions options);
// Return the value from the serializer.
Similar JsonPath functionality was proposed with https://github.com/dotnet/runtime/issues/31068 and also in https://github.com/dotnet/runtime/issues/55827 where I said I'll provide a sample to do this with V6 code, however Stream+SingleValue wasn't mentioned (which complicates things).
Some options for Stream:
All options would have minimal processing of the unneeded values -- e.g. values that are not the target JsonPath will not be deserialized.
I can work on the sample now if there's interest, but would like to know if "drain" is acceptable or not.
No thank you @steveharter, repeatedly parsing is unacceptable for the problems described in this issue and in #33030.
OK I'll work on a prototype; since it involves async+Stream the prototype may need to change internal STJ code instead of just an add-on sample.
I am reading a single complex object from a
FileStream
usingJsonSerializer.ReadAsync<T>(stream, options, token)
and I seriously just want the one value, but I am getting this exception.The JSON in the file is preceded by bytes which I have skipped or already processed and it is followed by more bytes which I can skip or process separately. However I get stuck by this exception and don't get the value that is presumably correctly read.
I have no idea how long the value is. Pre-parsing the JSON at a lower level to just get the byte length is a bad solution.
Wouldn't it be better to not throw in this situation and let the consumer of the API choose to check for end of stream?
Similar to dotnet/runtime#29947.