Open RobThree opened 10 months ago
I would get right of 0xFE
and identify a null value by a zero length byte sequence read from the previous to the next 0xFF
.
This can even simplify the spec by having only 2 special bytes:
0xFF
for End Of Value
0xFE
for End Of Row
No, that won't work because you won't be able to store an empty string (which is not the same as a null value).
I agree with Rob's suggestion and would like to add one additional reason. I would like to interpret 0xFF as "end of value that is present", and 0xFE as "end of value that is not-present/disabled". The algorithm the becomes to "read bytes up to an end mark" in either case (it's just the current case is that all nulls are empty strings, but loops in my intended use case (or maybe it is a debugging use-case?) where values can be toggled on/off by changing one byte (changing 0xFF to 0xFE) while keeping the file otherwise unmodified.
I understand it is probably outside of the original scope of RSV, but just thought it would be better to speak up now and potentially avoid a confusion-inducing RSV fork for our oddball case.
I think having multiple separators complicates decoding and is a minor optimization for most use cases. The way it is, in some languages you can use split(string, 0xFD) or split(string, 0xFF) to split the rows and fields and use iterators to operate on the results... this is nice.
A split()
will typically require you to pass the entire data which makes it horribly inefficiënt and, essentially, limits you to whatever memory you have before having to hit the swapfile etc. A split()
is cool for simple files, but as they get larger you'll want a reader that is streaming. That way you can read bytes and, as you encounter a separator or null
value, emit each field / row. The limiting factor then becomes the field size and not the entire file.
Also, without a null
indicator you won't be able to differentiate between null
and an empty value, which is, again, a huge limiting factor.
Very true regarding streaming, though split could be set up to stream as well. I haven't examined very many examples in this repo, but the ones I have don't look to be set up for streaming.
I'm not advocating removing the null
indicator, that clearly needs to be there.
What I was getting at and didn't explain well is that adding decisions in the code adds complexity to the decoding processes. My understanding of the point of RSV was simplicity. I realize having multiple field separators which change a field to null or not does not add great complexity, but those sorts of decisions can make an elegant solution much less so quickly. That sort of functionality shouldn't be in RSV, add a filter up a layer.
@dcoai I'm still not sure I understand. You're advocating keeping the field separator after a null value instead of what I proposed in this issue (getting rid of the field separator after a null value)? Because, yes, that would make sense when using split()
etc. and I can see that point. But you don't address what happens when files get bigger and where (common) split()
implementations will run out of memory as opposed to a streaming reader which will have, essentially, no limitations.
I think also the (common) split()
will suffer a lot more from Shlemiel the painter’s problem and be much slower because of many (re)allocations where a streaming algorithm can just start reading and emit fields and rows as it goes. I'm not saying a streaming split()
can't be done, but I am saying that in most languages it isn't as far as I'm aware.
You're advocating keeping the field separator after a null value instead of what I proposed in this issue (getting rid of the field separator after a null value)?
Yes, this is what I'm advocating and the main point I was trying to get at. I should also add, I have very little investment in RSV at this point, so my opinion doesn't/shouldn't carry much weight. It is just my opinion.
split()
, was an offhand attempt at "for example:" why I thought a single separator was better. I am in total agreement with the issues you bring up for split()
. Clearly for large data-sets a streaming solution is preferable. Unfortunately the split example was not very good and diverted from the point I was trying to make. So to be clear, I'm not trying to defend split as a good approach. I made a quick 4 line encode and 4 line decode in Elixir to try it out using split, so it was on my mind.
To briefly restate what I was getting at, hopefully more clearly:
The way I see it, 0xff
is a field separator, and 0xfd
is a row separator. 0xfe
is a bit of data that means null
. So I like keeping 0xfe
as strictly data. Maybe in the future someone would think it was good to add 0xfc
to mean that a field had the same value as the previous. (just an example, not a proposal :). Ultimately it is just a preference, either approach works.
@Stenway Maybe this issue should be transferred to https://github.com/Stenway/RSV-Specification now that the spec has a separate repository?
Hi there, thanks for your discussion about this. The null-byte terminator was also a hot topic in the comments section of the video, so I've added a frequently asked questions section (FAQ) to the spec repository: https://github.com/Stenway/RSV-Specification#why-is-the-null-value-byte-terminated-with-a-value-terminator-byte also concerning the differentiation between empty strings and null values: https://github.com/Stenway/RSV-Specification#why-does-rsv-differentiate-between-empty-strings-and-null-values
Split is definitely not ideal. And sometimes there isn't a builtin byte-array based split method in a programming language. But when it is available, it usually leads to less lines of code for the decoder. I've implemented versions in Python, Go and C# for testing purposes. The Python implementation is actually quite nice I think: https://github.com/Stenway/RSV-Challenge/blob/main/Python/rsv.py#L70
So sacrificing this might harm the adoption.
Nevertheless I've added an Experiments folder to the repository, and added an implementation without the terminator byte for the null byte, for comparison: https://github.com/Stenway/RSV-Challenge/tree/main/Experiments/NullByteWithoutTerminator
Cheers Stefan
Why would you require a field separator (
0xFF
) after anull
value (0xFE
)? Why not have0xFE
signal both the null value and "end of field"? That would make reading RSV much easier because when reading a0xFE
you can immediately emit thenull
value and don't need any more shenanigans to wait for reading the0xFF
, looking back in the buffer if there's one and only one byte in it and it isnull
etc. That way you can make it 'streaming' much easier IMHO.