Open ronag opened 5 years ago
it turns out... that when the revid is 32 characters long couchdb assumes it's a hex string...
I think if the hex parse fails it should fallback to just doing l2b
?
Are you really making your own _rev tokens? That's pretty far outside the supported API.
parse_revid(RevId) when size(RevId) =:= 32 ->
RevInt = erlang:list_to_integer(?b2l(RevId), 16),
<<RevInt:128>>;
parse_revid(RevId) when length(RevId) =:= 32 ->
RevInt = erlang:list_to_integer(RevId, 16),
<<RevInt:128>>;
parse_revid(RevId) when is_binary(RevId) ->
RevId;
parse_revid(RevId) when is_list(RevId) ->
?l2b(RevId).```
making your own _rev values is quite unusual. What are you doing?
That's pretty far outside the supported API.
It is? How so?
What are you doing?
Adding tracking such as modified time & modified by
what is the consensus about this issue? Are custom _rev tokens allowed or not? if not, any invalid _rev token should be rejected with proper error message.
The current state is very unclear.
I am using custom suffix on _rev value to track merged and resolved conflicts suffix "M" - merged (example: 15-4b7c15ab6fa3a8271935ea91eddf8eb1M) suffix "R" - resolved - deleted conflicted revision (example: 11-45637d8602d84d82e395a6fa55d12f80R)
The two uses, "Adding tracking such as modified time & modified by", are best done with json key/value attributes rather than co-opting the _rev value.
Tracking merged and resolved conflicts is also not common (I've never heard of anyone do this before you mentioned it four hours ago). Conflicts are resolved by deleting the revisions that "lose", even if you have to make a new revision that is a combination of others. I'm curious as to what value you get out of this that isn't already present in our MVCC system.
Custom _rev values are supported but it is not common to do it, hence the presence of bugs or unexpected behaviour. The historical reason we assume that 32 char rev strings can be decoded to integers escapes me but would need rooting out before we could determine if it was safe to remove those clauses.
Also worth noting that replication will not adopt your custom _rev approaches and will make the normal _rev strings.
the == 32
clauses dated back to 2009 when deterministic _rev was first introduced. I'm inferring it was done as a space/time optimization (integers are smaller than the binary text expression of them), so perhaps a fallback to just using the rev if the decode fails would be acceptable after all.
Given the future foundationdb transition and its constraints on key/value sizes, this is probably a good time to make a project decision on whether users can generate their own rev values and, if so, what constraints we might apply to them (a maximum length seems very likely to be one of them).
For now... isn't it simply a case of trying to parse as MD5 and if that fails fallback to l2b?
I understand this is more of a fundamental issue and this might be removed/deprecated in future versions. But previous versions where it is supported shouldn't we at least avoid these kinds of "magical" fail cases?
"I'm curious as to what value you get out of this that isn't already present in our MVCC system."
Easy: This allows you to distinguish between intentionally deleted document (without R) and deleted revision (with R).
And you are right, this still can be handled by a key in the document itself. But revision is always available without need to download the whole document (include_docs=false)
EDIT: ... and also in revision log (_revisions: {"ids":..., "start":...}}
I still think the decoding as integer is about memory usage in the erlang runtime system. We "know" we can decode revs of that length to a number because we created them. A rev value itself has no meaning, it's only for comparison at update time. The deterministic rev system confuses matters by adding a meaning to gain an advantage (the same update to two separate couchdb systems generate the same new _rev, so that replication between them later does not create a conflict).
I've invited others to comment (and obviously this thread is public) but my view at present is that we should not remove the '== 32' clauses or add a fallback if those don't decode correctly. At most, I suggest we add a line in the docs that user-generated rev values are discouraged and a stipulation that if those values are exactly 32 characters long they are required to be encoded integers.
The determining factor here will be whatever we are going to define for CouchDB 4.0. If we're supporting custom _rev values in that release, we will have to provide documentation on what rules apply (maximum/minimum length, allowed characters, etc) and then, as you've requested, we'll consider any rev value that doesn't follow those rules as an error and reject them, with a readable error message.
I would like to argue against the stance to not to add a fallback.
Custom revs are supported and no limitations have been specified. Especially considering this limitation is a bit "random" from an API user standpoint. Hence, there might be a lot of users out in the wild who are generating customs revs and their databases can randomly "fail". This is what happened to us. We've been running for 4 years like this at multiple clients and it worked fine until it didn't, in the middle of a live broadcast :/
I understand that a fallback is a slow path, but it's a lot better than what seems like a "random" failure.
@rnewson My $0.02:
@wohali agree that this has been the behaviour since forever (since 1.0 itself) and so this is definitely not a regression or should be a surprise to anyone making bespoke rev values.
for 4.0, I have a mild preference for saying that only couchdb gets to generate these values but the reality is that pouchdb generates rev values using a different algorithm and we (couchdb) should also be free to change (i.e, improve) this in future. I note, for full disclosure, that prior to the "deterministic rev" work that _rev used to be a random uuid. Replication works so long as the rev stays the same if the doc does, and changes if the doc changes. Anything beyond that is enhancement.
So... in 4.0 I'd be fine with us defining what the valid format for a _rev string is and explicitly supporting any string that matches that (and dropping the integer conversion when length is 32 chars). 3.0 could maybe drop the '== 32' clauses as a stepping stone?
Other voices should chime in, I'm not comfortable defining this here. (cc @davisp @kocolosk @chewbranca @daleharvey, a response from all of you would help, of course other voices are equally welcome).
(I am clearly very on the fence here and could tip either way. I do 100% agree that this halfway situation is not ok. You can define your own _rev strings but there are weird undocumented gotchas.).
Since I know the gotcha this is not really a problem for me. I just add an extra char at the end to make it 33 chars when it would be 32 chars otherwise. However, I am a bit confused about the reluctance to just add a fallback? I can only see advantages and no disadvantage. Is it because you don't want to have to create new patch releases for 1.x and 2.x?
This behaviour comes from ~2009 so it's been this way for a very long time.
Well... we've been running since 2014 and had no problems until just the other week... so I'm not quite sure what this comment tries to point out? It being a problem is very unlikely... but it can happen none the less
It's not a reluctance to make releases (we would not make a 1.x release for anything but a security related issue anyway).
This behaviour has existed since 1.0 and the consequences of changing it are not fully known. I appreciate you don't see a disadvantage but that isn't sufficient.
By deciding what we'll support in 4.0 we can plan what 3.0 might do and give more people a chance to chime in with hitherto unforeseen problems that might arise (or, we hope, more confirmation that there's no problems).
Summary: It's been this way for 10 years, we're not changing it without a lot more input.
Re: https://github.com/apache/couchdb/issues/2015#issuecomment-487321849
but the reality is that pouchdb generates rev values using a different algorithm
There are also other replicator compatible data stores that generate rev IDs differently to (at least current) CouchDB e.g. the Cloudant sync libraries. We don't do anything particularly unusual, in fact, we use a random UUID similarly to CouchDB prior to deterministic revs. I do think that maintaining backwards compatibility with those types of revs probably ought to be a goal here as trying to migrate rev formats in existing datastores would be painful.
If there have to be some sort of forced rev format/algorithm then I think there needs to be clearly defined versioning for replicators such that Couch N would always be able to replicate with rev ids that were supported by the replicator from at least Couch N-1, for example.
I haven't got any issue adding a fallback that just reuses the provided rev even if its doesn't parse as hex. I'd agree that its basically just space savings for both RAM and disk usage.
The only issue I have with people creating "meaningful" revisions is that there's always the possibility that we evolve our own "differently meaningful" revisions in the future which may end up being incompatible. However, given that we're currently not versioning them that's also on us to be thoughtful of if we were to ever do that (not that I've ever heard of any ideas around doing that, just noting the possibility exists).
Also, fun fact but we also have weirdness in the opposite direction. If someone were to create a custom revision that is 16 bytes long, it'll get hex-ified [1] on the way out which is also something to think about.
[1] https://github.com/apache/couchdb/blob/master/src/couch/src/couch_doc.erl#L70
Also, fun fact but we also have weirdness in the opposite direction. If someone were to create a custom revision that is 16 bytes long, it'll get hex-ified [1] on the way out which is also something to think about.
Not sure how to solve that... even with a fallback... don't use custom revid that are 16 chars long :)
A controversial suggestion... but if you don't use this optimisation in CouchDB 1-3 (i.e. no "smart" parse to 128 bit integer) and then re-add it in e.g. CouchDB 4 (where it is clearly documented)... then the currently documented/assumed behaviours will all work... with a performance regression of course (revid will take 16 bytes extra memory)... but how big is that regression in reality?
(we would not make a 1.x release for anything but a security related issue anyway)
just to clear a minor point up: 1.x is even past security releases.
This PR looks relevant https://github.com/apache/couchdb/pull/1897/files
@rnewson @davisp @janl In light of 4.0 development, we should decide one way or the other on this.
In short: are we going to allow user-defined _rev
values in 4.0? Or no?
bump on this. @garrensmith @rnewson Are we going to prevent user-defined _rev
values in fdb-based CouchDB? This issue makes it clear it could be a "sharp edge" that users could get caught on.
If so I want to mark this in the documentation for 3.x ASAP.
As a developer who uses it, I selected couchdb in part because the _rev is opaque and CAN be controlled from outside couchdb. Using md5(erlang-data-representation) in those _revs is far from ideal for deterministic versioning, so it's not just a nice point of flexibility. But it is nice, for sure.
Would push replicators from other versions/implementations (including and not limited to PouchDB) have to pull the new _rev and write it back to disk, when pushing to a server that ignores provided _rev's? If so, that would be a minor ick.
Impact assessment: if one generates revs that have a 32-byte non-int, one would get an error, search and find this issue and then add or remove a byte from their _rev representation. I'm not seeing that as "sharp", myself. Did I miss something?
I would suggest, as the simplest change, to continue the existing semantic of "support provided _rev's opaquely", while correcting the unintended crashahem, error-on-32-char-rev's-not-parsable. I might also add any needed guidance for integrators and authors of custom replicators that they SHOULD generate deterministic _rev's if they use an algorithm different from couchdb's default internal mechanism.
Might it be helpful to strengthen any language for proxies and replication implementers to treat observed _rev's as opaque at baseline? while possibly noting that custom replicators are allowed to probe for any special meanings for some optimization purposes (just as the int-parsing logic does currently), but that they MUST be robust to opaque _rev's that don't fit their hopes and dreams. I think that's another way of indicating the high-level outcome that it seems @ricellis drives at.
@davisp It's possible to suggest people SHOULD NOT use flags in the revs for indicating such facts as deletion or update-times, although given that _rev's are opaque values, it's not clear to me that these bits of meaning even could be harmful to anyone (reminding myself that if one does such things, they can also evaluate and test against any potentially-colliding changes made in later releases of the server) - essentially as Paul said. I'd leave off this bit, myself.
There's some additional perspective from the field.
Can good 32-byte-hex-ints continue to get an optimized code path, while other 32-byte _revs get a working result as with non-32-byte _rev's?
I’m in favour of allowing external revs, and I’d be okay with documenting clearly that 32-byte-non-int revs are an obtuse edge-case that we’re not changing, so implementers will have to do some trickery to work around that.
In addition, whatever 4.0 limitations exist, those need documenting then too.
For the record, I don't care one way or the other, but if this is a use case we're going to explicitly support, it deserves to be documented correctly, along with any limitations. I'll help with that if someone can explain the 3.x/4.x behaviour to me thoroughly enough.
-Joan
Does revid have some undocumented limitation? couchdb 1.7.1