Closed chanced closed 1 day ago
Alternatively, we just continue to use the offsets.
Dealing with split_at
and using offsets is somewhat of a pain.
I'm still not sure what the right answer is here. Some of the API treats the Pointer
as a string, and in a lot of cases this is what you'd want, but it also behaves like a collection.
If we double down on the Pointer
being a collection, then we'd need to re-evaluate the methods which treat it as a string.
It wasn't horrible to write the iterator below, but I think it could be made a lot cleaner. If Component
were something like
enum Component<'a> {
Root,
Component {
/// full path to this component
path: &'a Pointer,
offset: usize,
token: Token<'a>,
token_pointer: &'a Pointer,
}
}
Here's the iterator:
#[derive(Debug)]
pub struct WalkTo<'p, 'v> {
cursor: &'v Value,
original_value: &'v Value,
full_path: &'p Pointer,
offset: Option<NonZeroUsize>,
}
impl<'p, 'v> WalkTo<'p, 'v> {
pub fn new(value: &'v Value, to: &'p Pointer) -> Self {
Self {
cursor: value,
original_value: value,
full_path: to,
offset: None,
}
}
fn handle_err(
&self,
_err: ResolveError,
path: &'p Pointer,
_resolvable: &'p Pointer,
) -> Option<Result<(&'p Pointer, &'v Value), ResolveError>> {
// TODO: determine appropriate offsets and update error rather than re-resolving
// for now, we are just punting to `Resolve` to recreate the error
Some(Err(self.original_value.resolve(path).unwrap_err()))
}
}
impl<'p, 'v> Iterator for WalkTo<'p, 'v> {
type Item = Result<(&'p Pointer, &'v Value), ResolveError>;
fn next(&mut self) -> Option<Self::Item> {
// we need to get the offset to determine where we are in the pointer
let offset = if let Some(offset) = self.offset {
// the offset has previously been set, so we use that
offset.get()
} else {
// An empty offset means we are at the beginning of the walk.
// this is a special case where the target path is root
// we need to handle this separately because we later account
// for tokens, which an empty path has none of.
if self.full_path.is_root() {
// the target path is root, so we set the offset to 1 ensures we
// do not send repeats due to the bounds check on offset below.
self.offset = NonZeroUsize::new(1);
return Some(Ok((Pointer::root(), self.cursor)));
}
// if the offset was not previously set, we start at 0
0
};
// checking to make sure we are not out of bounds
if offset > self.full_path.len() {
return None;
}
// split the path at the offset, where everything to the left
// is the full path of the current value to be sent and everything
// to the right is the remaining path to be resolved.
let (path, remaining) = self
.full_path
.split_at(offset)
.unwrap_or((self.full_path, Pointer::root()));
if let Some(next) = remaining.first() {
// if there is a next token, we set the offset to the next token's length
// plus 1 to account for the slash.
self.offset = NonZeroUsize::new(offset + next.encoded().len() + 1);
} else {
// otherwise we intentionally push the offset out of bounds
self.offset = NonZeroUsize::new(offset + 1)
}
// we want the last token as a `&Pointer` so that we can use the resolve logic
// the path is either splittable (contains a token) or is empty (root).
//
// If it is splittable, we either use the token's length to determine the token's
// offset and split the path there or we use the root pointer.
let resolvable = path
.last()
.map(|t| path.split_at(path.len() - t.encoded().len() - 1).unwrap().1)
.unwrap_or(Pointer::root());
// we attempt to resolve the value
let result = self.cursor.resolve(resolvable);
let value = match result {
Ok(ok) => ok,
Err(err) => return self.handle_err(err, path, resolvable),
};
// moving the cursor value to the resolved value
self.cursor = value;
Some(Ok((path, value)))
}
}
https://github.com/chanced/grill/blob/main/grill-core/src/source/walk/to.rs
I think a Cursor API, like the experimental one in BTree, would be awesome.
I sort of figured indices weren't the right abstraction for this. And I've had to deal with simultaneous traversal of a Value and Pointer too. Cursor feels like the right abstraction for this. 👍
Ah, that is interesting. I can't quite get my head around how it'd feel interacting with it though. It may be too complex for some use cases while incredibly useful for others. The ghost elements are definitely unusual though.
One thing that would have cut down the complexity of the iterator a bit is a resolve_token
method on Resolve
/ Pointer
. It would have eliminated this:
let resolvable = path
.last()
.map(|t| path.split_at(path.len() - t.encoded().len() - 1).unwrap().1)
.unwrap_or(Pointer::root());
It's a simple addition but would be a breaking change.
I think we'd need some sort of constructed pointer type if the idea is to use the cursor to traverse a value. BTreeMap
has the advantage of owning both key and value.
I guess the Cursor
could have an internal PointerBuf
that it manipulates and then hands out &'cursor Pointer
edit: nah, this would be a bad idea.
Had a closer look at your example above, help me understand what you're trying to achieve there.
Seems to me that a big chunk of complexity stems from the fact you wish to produce a partial pointer with every iteration, together with the value you're pointing at. That's a bit niche/unusual, and I suspect that's a big part of why it ends up being complex. Do you really need that partial pointer? If not you could get by with this:
#[derive(Debug)]
pub struct WalkTo<'p, 'v> {
value: &'v Value,
tokens: Option<Tokens<'p>>,
}
impl<'p, 'v> WalkTo<'p, 'v> {
pub fn new(value: &'v Value, to: &'p Pointer) -> Self {
Self {
value,
tokens: Some(to.tokens()),
}
}
}
// placeholder
#[derive(Debug)]
pub struct ResolveError;
impl<'p, 'v> Iterator for WalkTo<'p, 'v> {
type Item = Result<&'v Value, ResolveError>;
fn next(&mut self) -> Option<Self::Item> {
let tokens = self.tokens.as_mut()?;
let curr_value = self.value;
if let Some(next) = tokens.next() {
match self.value {
Value::Null | Value::Bool(_) | Value::Number(_) | Value::String(_) => {
// cannot descend further
self.tokens.take();
return Some(Err(ResolveError));
}
Value::Array(vec) => {
let idx = match next.to_index() {
Ok(idx) => match idx.for_len(vec.len()) {
Ok(idx) => idx,
Err(_) => return Some(Err(ResolveError)),
},
Err(_) => return Some(Err(ResolveError)),
};
self.value = &vec[idx];
}
Value::Object(map) => {
self.value = match map.get(next.decoded().as_ref()) {
Some(val) => val,
None => return Some(Err(ResolveError)),
};
}
}
} else {
// we've reached the target node
self.tokens.take();
}
Some(Ok(curr_value))
}
}
If you do need the partial pointers, I think we could just add a slicing API so you could do ptr[..idx]
(where idx
is the token index, not a character index) and get a subset of the path easily. To do it without allocation I think we can't really lean in on Tokens
, unfortunately, and maybe that's where you're getting to, but it'd still not be too bad to implement on our end.
I'm inclined to make split_at
also take a token index instead of an offset. It harmonises better with the Tokens
API, and is safer to use. I know we've talked about this before, but I feel stronger now.
The niche comment likely answers a question I was going to present, which is whether providing a walk impl like the above would make any sense. I actually have it partially implemented, with WalkTo being done for json. I was going to submit a pull request when I had WalkToMut
and WalkFrom
done. I haven't ported WalkFrom
yet.
I was able to clean it up, or at least that line, by using split_back
. I also deleted the original_value
from the iterator.
My situation may be niche though, indeed. I need the full path and value of eqch node along a path.
@asmello I was mistaken on the split_back
. I had to revert it. I submitted a draft PR #83 that has the WalkTo
impl. The codebase is a complete mess right now, apologies in advance. I pulled in both my WalkTo
and WalkFrom
impl and was porting them.
@asmello I was on mobile earlier and hadn't read all of your reply - I just wanted to push up the source I had mentioned before responding to the rest.
Believe it or not, I was contemplating ranges as well. My initial inclination was to change the signature of this:
impl Pointer {
pub fn get(&self, index: usize) -> Option<Token> {}
}
to
impl Pointer {
// this is incorrect - there should be generics but i have no idea what they'd be
pub fn get<I: std::slice::SliceIndex>(&self, index: I) -> Option<I::Output> {}
}
and leave the split_at
using offsets. However, after a bit of thought, that won't fly. split_at
should still mirror slice
/ Vec
behavior here. I also don't know enough about the std::slice::SliceIndex
API to know whether or not we can use it mirror slice's get
behavior. If it isn't possible, I assume the behavior could be replicated but no idea on that lift.
I've been thinking about this more, and I don't really get the use case for WalkTo
in general. I tried looking for how it's used in grill
, but I couldn't find any references to it?
In general I'm well aligned on the usefulness of traversing a serde_json::Value
using a jsonptr::Pointer
. This is the primary use case JSON Pointers were designed for, and why we have Resolve
and ResolveMut
. What I'm not really visualising is how we'd use "partials" (as in, the partial path, and the subtree of the JSON Value) while descending to the target node. Maybe it'd be useful for inspecting the types along the way, in a way that's more robust than just inspecting the pointer itself, but that doesn't really explain why we'd emit the partial path too. I'd like to understand that use case first before we talk about the specifics of the API here.
Separately, I think we can just go ahead and add a slicing API, I think that's actually useful on its own. Will do some experimentation with it.
Ah, sorry! Yea, it's a mess over in that repo at the moment. I'm in the middle of rewriting it to use the iterator but I have to swap things over there.
This is going to be probably more explanation than you're hoping for but I'll try and keep it short.
With json schema 2019+(refs existed in 07 but didn't do anything), you can reference a foreign schema by uri. If that URI contains a json pointer, you have to resolve that location and compile only it.
However, ancestors of the schema can set the dialect ($schema
), change the base uri via $id
, introduce anchors, etc. so while they may not need to be compiled, there is information that has to be checked to see if it influences the target.
What you see there is me trying to compile up from the target. Basically I was going to try and check one level up, see if it contained this schema as a sub schema, if it did, index it (not compile) and then search up for the schema which had it as a sub schema.
Ultimately, I decided that just wasn't worth it. For one, each of those checks requires re-resolving the schema being scanned. Second, the logic was going to be messy and error prone. So I'm going to go back to my initial design (sort of, cleaned up, hopefully), and just scan from the root all the way up to the target. I suspect this may change in the future but it's the easiest and safest way for me to finish.
Basically I'll scan the root, check if it contains a sub schema that has a partial pointer that matches my target, and then skip to that sub schema and repeat. If the schema I'm scanning doesn't contain a subschema with a partial path of my schema, I move to the next and check it until I find one that does or until I hit my target.
In either case, I need the full path as I take it and create a URI with the base uri $id
(or whatever they resolved with) and the path as the fragment. Each location within a source is saved as an InternalLink
which is persisted in Sources
.
I keep all of the root documents as Arc<Value>
and locations within it have links. The actual execution / validation of values with the schemas do not use the Sources
. All necessary data needed to evaluate is replicated within the keyword's instance - or will be, once I get back to it. Aside from compiling, Sources
exists so the values of schemas are present for analysis and eventually serde of the entire Interrogator
instance.
This is at least my second time writing the lib - the previous, most complete, version was a hot mess of oop mentality, slammed full of dyn
, anymap
, etc.
I should mention that the scenario described above explains WalkTo
but doesn't explain the need for WalkFrom
.
In the case where an external schema is referenced by an unknown anchor(eg example.com/schema#some_anchor
), things get a bit more messy.
For unknown anchors, I need to start from the root document and scan it and all descendant schemas. As part of the scan, each schema has one or more links created for it, based on the path and any anchors. After scanning the whole doc, I check Sources
for the uri with the anchor. If it has been linked, I have my target.
The reason it isn't simply walk is because in json schema, any schema which has an $id
is considered a document root. As such, I may need to start deeply embedded in the actual value and traverse, scanning all possible schemas within it.
* confusingly enough, I also have edits above the edit header. Terribly sorry if you read it as an email only to discover I have edited it without clear distinction. I need to get better at proof reading my replies, especially if I'm on mobile.
The more I think about this, the more convinced I am that the walk iterators don't belong in this crate, at least not in their current form. In fact, I may not end up using WalkFrom
as I'd burn extra allocations for skipped entries (non-schemas).
I'm certain that a range API is the way to go. It'll allow consumers to craft their own bespoke traversal strategies a great deal more efficiently (at least in terms of dev expenditures).
However, if we are going to double down on it being a collection type, which I think is the right path, we should probably revisit len
as well. That's going to be a nasty surprise for anyone that's using it and doesn't read the release notes though - I regret rushing that addition through.
However, if we are going to double down on it being a collection type, which I think is the right path, we should probably revisit len as well. That's going to be a nasty surprise for anyone that's using it and doesn't read the release notes though - I regret rushing that addition through.
It's a good point. I don't think the choice we've made was unreasonable. str
uses the same definition, and has a similar duality: it's conceptually an ordered collection of char
items, but str::len
returns the size in bytes. I don't think there's a 'right' answer here regardless of how we proceed with the rest of the API.
We can introduce a Pointer::count
as a shortcut to Pointer.tokens().count()
, though not sure it's worth it. The latter is quite readable and explicit, even if it takes a couple method calls.
Also, perhaps one point in favour of len
being the byte length: generally developers expect len
to be a O(1). In std
, I think that's true for all collection types (even LinkedList).
That said, I suppose we could make it constant time if we kept a token count separately (as the original implementation did), at a small cost in bookkeeping. Will give this some thought, doesn't seem obviously better, but it's worth considering.
Yea, this a tough one. I'm not sure what the right call is. Inversely, .as_str().len()
gets the byte count.
Lets stew on len
, it doesn't need to be decided now.
Here's a radical idea: let's get rid of len
altogether.
std::path::Path
doesn't have it.
We can introduce Pointer::size
to replace it. This is a much better name.
Then if one wishes to count tokens, they can use .tokens().count()
explicitly. I don't think that's bad. And we can always add Pointer::count
later if it proves a pain point.
We can deprecate len
for next release (or the one after) to avoid an immediate break.
Let's sit on this for a while, though. 👍
I'm game for removing it. I think that's the answer.Path
setting the precedent seals the deal for me.
Even if we end up adding it back, I think it can be added after it goes through the deprecation cycle. By that point, it'll have been at least 2 minors. Any problems that arise will be due to jumping several minor releases.
Added a note in #84 about the complexity of token-based indexing.
We need a way to slice / split a token by range, using the index.