Open GoogleCodeExporter opened 9 years ago
Good point Chad. Json serializers was converted from existing similar
serializers
(stax/xml one, maybe), so it has inherited rather care-free parsing approach.
I'll
try to fix that -- catching these problems is important, that's why code to
check
that serialized objects round-trip ok was added (but obviously not tested with
much
variation).
Original comment by tsaloranta@gmail.com
on 25 Jun 2009 at 5:32
Ok: I improved json and stax/xml serializers such that they now properly handle
variable-length lists. For JSON this means using array/list struct (existing
code
actually produced invalid json, too, with duplicate kyes); and for xml keeping
track
of actual start element names.
Might make sense to run tests to check whether these flaws had significant
effect on
numbers.
Original comment by tsaloranta@gmail.com
on 25 Jun 2009 at 6:40
Thanks for looking into this issue. The work you did handles one class of
problems
but I am afraid that there are others. Judging by the Thrift and protobuf IDLs,
many
of the fields are intended to be optional. However, I don't think the JSON
deserializer is acting as if any of the fields are optional (and others may
also have
similar problems). If you add a field that isn't currently being serialized, the
deserialization will break. If you remove a field that is currently being
serialized,
the deserialization will break. This means that you need to add more metadata
into
the JSON (field names and/or field IDs) and that metadata needs to be checked
by the
deserializer to see what field is being read. Anything less than that indicates
that
the deserializer is acting with knowledge about the data that it simply can't
have.
If you assumed (as the Avro tests do) that some general schema information has
been
transmitted out-of-band, there needs to be, at a very minimum, some way of
marking
optional fields as either present or absent in the particular packet of
serialized
bytes that is being deserialized (and also some indicator of which
pre-transmitted
schema should be used).
Original comment by chad_wal...@yahoo.com
on 25 Jun 2009 at 7:35
Chad, while there are questions of proper handling, I think you may be assuming
things that are not true. For example, whether handling of extra or missing
fields
should be included depends on usage scenario, and it is not generally true that
serializers should handle those in ways other than, say, throwing an exception.
Same applies to many other aspects -- whether ordering of fields is fixed, how
much
flexibility is there and so on. Or whether it matters if one can deduce whether
field
is missing or has null value etc. etc.
This is not to say that there wasn't room for improvement: code still assumes
some
things a real world systems would use. And improvements are always welcome. For
case
of optional fields, solution is quite straight-forward, just more work: see
what is
included, deserialize (or, if one wants to serialize only non-null fields, do
the same).
My understanding of the test is that it tries to give a rough but reasonable
test for
one use case. And having a priori or out-of-band information may be perfectly
legal
-- after all, for Protocol Buffer (for example) to work at all, one must have
used an
external schema to produce binding code. Whether it is considered fair or not is
subjective.
I do not know what Avro tests are to do, since I did not write them. Perhaps
someone
else can comment onn that one.
Original comment by tsra...@gmail.com
on 26 Jun 2009 at 4:20
Perhaps I am assuming too much about what this project is attempting to compare.
Let's try to elaborate a few things and see if we can come to a shared
understanding
around that.
You said above that the tests are attempting to capture only one use case but it
isn't clear to me from the project Wiki or documents what that use case is. I
think
that clearly identifying what use case the tests are attempting to capture
would go a
long way towards resolving some of my concerns.
Is there anywhere a clear statement of what the minimum supported feature set
of the
serializers should be? If not, maybe we should add one so that people can
understand
what class of use cases is being benchmarked.
If we wanted to get more ambitious and extend beyond that, we could even have
multiple use cases identified that require additional levels of functionality
and
tweak some of the serializers so that they have different "flavors" so that we
can
try to demonstrate the cost of layering on additional functionality.
WRT out-of-band transmission of information, I have no general issue with that
as
long as there is a clear understanding of what can be legitimately transmitted
out-of-band and what can't be.
I think we both agreed that having the size of a field defined as a variable
length
list in the schema for particular object instance travel out-of-band was
probably too
much of an assumption (hence the patch discussed above) for almost any
reasonable use
use.
On the other hand, the Avro tests are likely fine for most potential use
cases. I
have the impression that those tests might assume that its OK that no schema
identifier needs to be passed in the per-object serialized packet, which would
rule
out some use cases.
WRT optional fields, I was operating under the assumption that support for
optional
fields was necessary since the Thrift and Protobuf IDLs have optional fields in
them.
Not to keep picking on the JSON serializer, but it does seem to assume that
particular fields from amongst a number of fields marked option in those IDLs
will be
present in a very definite order. If that is OK for the use case we are
interested
in, then we should probably tighten up the IDLs to reflect the use case more
accurately.
Original comment by chad_wal...@yahoo.com
on 28 Jun 2009 at 2:08
Good points Chad, lets continue the discussion on the mailing list.
Original comment by eis...@gmail.com
on 29 Jun 2009 at 7:35
Chad, I think we agree on all important points. So let's keep on improving test
cases
-- and yes, much of intended use case is only implied. And I like all
suggestions
presented.
On a related note, I continued cleaning up json converter, and will do the same
for
xml one. They will look more like production serializers I have done, main
difference
being that there is little bit less error diagnostics to keep code more
readable. I
understand JSON converter is used just as an example, given how similar many
streaming converters are (due to being based on initial xml one).
Original comment by tsaloranta@gmail.com
on 30 Jun 2009 at 5:52
Original issue reported on code.google.com by
chad_wal...@yahoo.com
on 13 May 2009 at 8:36