Stenway / RSV-Challenge

RSV = Rows of String Values
https://www.stenway.com
Other
89 stars 15 forks source link

Why require a fieldseparator after a null value? #2

Open RobThree opened 10 months ago

RobThree commented 10 months ago

Why would you require a field separator (0xFF) after a null value (0xFE)? Why not have 0xFE signal both the null value and "end of field"? That would make reading RSV much easier because when reading a 0xFE you can immediately emit the null value and don't need any more shenanigans to wait for reading the 0xFF, looking back in the buffer if there's one and only one byte in it and it is null etc. That way you can make it 'streaming' much easier IMHO.

txgruppi commented 10 months ago

I would get right of 0xFE and identify a null value by a zero length byte sequence read from the previous to the next 0xFF.

This can even simplify the spec by having only 2 special bytes:

RobThree commented 10 months ago

No, that won't work because you won't be able to store an empty string (which is not the same as a null value).

Osndok commented 10 months ago

I agree with Rob's suggestion and would like to add one additional reason. I would like to interpret 0xFF as "end of value that is present", and 0xFE as "end of value that is not-present/disabled". The algorithm the becomes to "read bytes up to an end mark" in either case (it's just the current case is that all nulls are empty strings, but loops in my intended use case (or maybe it is a debugging use-case?) where values can be toggled on/off by changing one byte (changing 0xFF to 0xFE) while keeping the file otherwise unmodified.

I understand it is probably outside of the original scope of RSV, but just thought it would be better to speak up now and potentially avoid a confusion-inducing RSV fork for our oddball case.

dcoai commented 10 months ago

I think having multiple separators complicates decoding and is a minor optimization for most use cases. The way it is, in some languages you can use split(string, 0xFD) or split(string, 0xFF) to split the rows and fields and use iterators to operate on the results... this is nice.

RobThree commented 10 months ago

A split() will typically require you to pass the entire data which makes it horribly inefficiënt and, essentially, limits you to whatever memory you have before having to hit the swapfile etc. A split() is cool for simple files, but as they get larger you'll want a reader that is streaming. That way you can read bytes and, as you encounter a separator or null value, emit each field / row. The limiting factor then becomes the field size and not the entire file.

Also, without a null indicator you won't be able to differentiate between null and an empty value, which is, again, a huge limiting factor.

dcoai commented 10 months ago

Very true regarding streaming, though split could be set up to stream as well. I haven't examined very many examples in this repo, but the ones I have don't look to be set up for streaming.

I'm not advocating removing the null indicator, that clearly needs to be there.

What I was getting at and didn't explain well is that adding decisions in the code adds complexity to the decoding processes. My understanding of the point of RSV was simplicity. I realize having multiple field separators which change a field to null or not does not add great complexity, but those sorts of decisions can make an elegant solution much less so quickly. That sort of functionality shouldn't be in RSV, add a filter up a layer.

RobThree commented 10 months ago

@dcoai I'm still not sure I understand. You're advocating keeping the field separator after a null value instead of what I proposed in this issue (getting rid of the field separator after a null value)? Because, yes, that would make sense when using split() etc. and I can see that point. But you don't address what happens when files get bigger and where (common) split() implementations will run out of memory as opposed to a streaming reader which will have, essentially, no limitations.

I think also the (common) split() will suffer a lot more from Shlemiel the painter’s problem and be much slower because of many (re)allocations where a streaming algorithm can just start reading and emit fields and rows as it goes. I'm not saying a streaming split() can't be done, but I am saying that in most languages it isn't as far as I'm aware.

dcoai commented 10 months ago

You're advocating keeping the field separator after a null value instead of what I proposed in this issue (getting rid of the field separator after a null value)?

Yes, this is what I'm advocating and the main point I was trying to get at. I should also add, I have very little investment in RSV at this point, so my opinion doesn't/shouldn't carry much weight. It is just my opinion.

split(), was an offhand attempt at "for example:" why I thought a single separator was better. I am in total agreement with the issues you bring up for split(). Clearly for large data-sets a streaming solution is preferable. Unfortunately the split example was not very good and diverted from the point I was trying to make. So to be clear, I'm not trying to defend split as a good approach. I made a quick 4 line encode and 4 line decode in Elixir to try it out using split, so it was on my mind.

To briefly restate what I was getting at, hopefully more clearly:

The way I see it, 0xff is a field separator, and 0xfd is a row separator. 0xfe is a bit of data that means null. So I like keeping 0xfe as strictly data. Maybe in the future someone would think it was good to add 0xfc to mean that a field had the same value as the previous. (just an example, not a proposal :). Ultimately it is just a preference, either approach works.

RobThree commented 10 months ago

@Stenway Maybe this issue should be transferred to https://github.com/Stenway/RSV-Specification now that the spec has a separate repository?

Stenway commented 10 months ago

Hi there, thanks for your discussion about this. The null-byte terminator was also a hot topic in the comments section of the video, so I've added a frequently asked questions section (FAQ) to the spec repository: https://github.com/Stenway/RSV-Specification#why-is-the-null-value-byte-terminated-with-a-value-terminator-byte also concerning the differentiation between empty strings and null values: https://github.com/Stenway/RSV-Specification#why-does-rsv-differentiate-between-empty-strings-and-null-values

Split is definitely not ideal. And sometimes there isn't a builtin byte-array based split method in a programming language. But when it is available, it usually leads to less lines of code for the decoder. I've implemented versions in Python, Go and C# for testing purposes. The Python implementation is actually quite nice I think: https://github.com/Stenway/RSV-Challenge/blob/main/Python/rsv.py#L70

So sacrificing this might harm the adoption.

Nevertheless I've added an Experiments folder to the repository, and added an implementation without the terminator byte for the null byte, for comparison: https://github.com/Stenway/RSV-Challenge/tree/main/Experiments/NullByteWithoutTerminator

Cheers Stefan