Allow for splitting a `Pointer` by `Token` position (index)

chanced commented 4 months ago

We need a way to slice / split a token by range, using the index.

/foo/bar/baz
↑   ↑   ↑
0   1   2

chanced commented 4 months ago

Alternatively, we just continue to use the offsets.

chanced commented 1 month ago

Dealing with split_at and using offsets is somewhat of a pain.

I'm still not sure what the right answer is here. Some of the API treats the Pointer as a string, and in a lot of cases this is what you'd want, but it also behaves like a collection.

If we double down on the Pointer being a collection, then we'd need to re-evaluate the methods which treat it as a string.

chanced commented 1 month ago

It wasn't horrible to write the iterator below, but I think it could be made a lot cleaner. If Component were something like

enum Component<'a> {
    Root,    
    Component {
        /// full path to this component
        path: &'a Pointer, 
        offset: usize,
        token: Token<'a>,
        token_pointer: &'a Pointer,
    }
}

Here's the iterator:

#[derive(Debug)]
pub struct WalkTo<'p, 'v> {
    cursor: &'v Value,
    original_value: &'v Value,
    full_path: &'p Pointer,
    offset: Option<NonZeroUsize>,
}
impl<'p, 'v> WalkTo<'p, 'v> {
    pub fn new(value: &'v Value, to: &'p Pointer) -> Self {
        Self {
            cursor: value,
            original_value: value,
            full_path: to,
            offset: None,
        }
    }

    fn handle_err(
        &self,
        _err: ResolveError,
        path: &'p Pointer,
        _resolvable: &'p Pointer,
    ) -> Option<Result<(&'p Pointer, &'v Value), ResolveError>> {
        // TODO: determine appropriate offsets and update error rather than re-resolving
        // for now, we are just punting to `Resolve` to recreate the error
        Some(Err(self.original_value.resolve(path).unwrap_err()))
    }
}

impl<'p, 'v> Iterator for WalkTo<'p, 'v> {
    type Item = Result<(&'p Pointer, &'v Value), ResolveError>;
    fn next(&mut self) -> Option<Self::Item> {
        // we need to get the offset to determine where we are in the pointer
        let offset = if let Some(offset) = self.offset {
            // the offset has previously been set, so we use that
            offset.get()
        } else {
            // An empty offset means we are at the beginning of the walk.

            // this is a special case where the target path is root
            // we need to handle this separately because we later account
            // for tokens, which an empty path has none of.
            if self.full_path.is_root() {
                // the target path is root, so we set the offset to 1 ensures we
                // do not send repeats due to the bounds check on offset below.
                self.offset = NonZeroUsize::new(1);
                return Some(Ok((Pointer::root(), self.cursor)));
            }
            // if the offset was not previously set, we start at 0
            0
        };

        // checking to make sure we are not out of bounds
        if offset > self.full_path.len() {
            return None;
        }

        // split the path at the offset, where everything to the left
        // is the full path of the current value to be sent and everything
        // to the right is the remaining path to be resolved.
        let (path, remaining) = self
            .full_path
            .split_at(offset)
            .unwrap_or((self.full_path, Pointer::root()));

        if let Some(next) = remaining.first() {
            // if there is a next token, we set the offset to the next token's length
            // plus 1 to account for the slash.
            self.offset = NonZeroUsize::new(offset + next.encoded().len() + 1);
        } else {
            // otherwise we intentionally push the offset out of bounds
            self.offset = NonZeroUsize::new(offset + 1)
        }

        // we want the last token as a `&Pointer` so that we can use the resolve logic
        // the path is either splittable (contains a token) or is empty (root).
        //
        // If it is splittable, we either use the token's length to determine the token's
        // offset and split the path there or we use the root pointer.
        let resolvable = path
            .last()
            .map(|t| path.split_at(path.len() - t.encoded().len() - 1).unwrap().1)
            .unwrap_or(Pointer::root());

        // we attempt to resolve the value
        let result = self.cursor.resolve(resolvable);

        let value = match result {
            Ok(ok) => ok,
            Err(err) => return self.handle_err(err, path, resolvable),
        };

        // moving the cursor value to the resolved value
        self.cursor = value;

        Some(Ok((path, value)))
    }
}

https://github.com/chanced/grill/blob/main/grill-core/src/source/walk/to.rs

asmello commented 1 month ago

I think a Cursor API, like the experimental one in BTree, would be awesome.

I sort of figured indices weren't the right abstraction for this. And I've had to deal with simultaneous traversal of a Value and Pointer too. Cursor feels like the right abstraction for this. 👍

chanced commented 1 month ago

Ah, that is interesting. I can't quite get my head around how it'd feel interacting with it though. It may be too complex for some use cases while incredibly useful for others. The ghost elements are definitely unusual though.

chanced commented 1 month ago

One thing that would have cut down the complexity of the iterator a bit is a resolve_token method on Resolve / Pointer. It would have eliminated this:

let resolvable = path
    .last()
    .map(|t| path.split_at(path.len() - t.encoded().len() - 1).unwrap().1)
    .unwrap_or(Pointer::root());

It's a simple addition but would be a breaking change.

chanced commented 1 month ago

I think we'd need some sort of constructed pointer type if the idea is to use the cursor to traverse a value. BTreeMap has the advantage of owning both key and value.

chanced commented 1 month ago

I guess the Cursor could have an internal PointerBuf that it manipulates and then hands out &'cursor Pointer

edit: nah, this would be a bad idea.

asmello commented 3 weeks ago

Had a closer look at your example above, help me understand what you're trying to achieve there.

Seems to me that a big chunk of complexity stems from the fact you wish to produce a partial pointer with every iteration, together with the value you're pointing at. That's a bit niche/unusual, and I suspect that's a big part of why it ends up being complex. Do you really need that partial pointer? If not you could get by with this:

#[derive(Debug)]
pub struct WalkTo<'p, 'v> {
    value: &'v Value,
    tokens: Option<Tokens<'p>>,
}

impl<'p, 'v> WalkTo<'p, 'v> {
    pub fn new(value: &'v Value, to: &'p Pointer) -> Self {
        Self {
            value,
            tokens: Some(to.tokens()),
        }
    }
}

// placeholder
#[derive(Debug)]
pub struct ResolveError;

impl<'p, 'v> Iterator for WalkTo<'p, 'v> {
    type Item = Result<&'v Value, ResolveError>;
    fn next(&mut self) -> Option<Self::Item> {
        let tokens = self.tokens.as_mut()?;
        let curr_value = self.value;
        if let Some(next) = tokens.next() {
            match self.value {
                Value::Null | Value::Bool(_) | Value::Number(_) | Value::String(_) => {
                    // cannot descend further
                    self.tokens.take();
                    return Some(Err(ResolveError));
                }
                Value::Array(vec) => {
                    let idx = match next.to_index() {
                        Ok(idx) => match idx.for_len(vec.len()) {
                            Ok(idx) => idx,
                            Err(_) => return Some(Err(ResolveError)),
                        },
                        Err(_) => return Some(Err(ResolveError)),
                    };
                    self.value = &vec[idx];
                }
                Value::Object(map) => {
                    self.value = match map.get(next.decoded().as_ref()) {
                        Some(val) => val,
                        None => return Some(Err(ResolveError)),
                    };
                }
            }
        } else {
            // we've reached the target node
            self.tokens.take();
        }
        Some(Ok(curr_value))
    }
}

If you do need the partial pointers, I think we could just add a slicing API so you could do ptr[..idx] (where idx is the token index, not a character index) and get a subset of the path easily. To do it without allocation I think we can't really lean in on Tokens, unfortunately, and maybe that's where you're getting to, but it'd still not be too bad to implement on our end.

I'm inclined to make split_at also take a token index instead of an offset. It harmonises better with the Tokens API, and is safer to use. I know we've talked about this before, but I feel stronger now.

chanced commented 3 weeks ago

The niche comment likely answers a question I was going to present, which is whether providing a walk impl like the above would make any sense. I actually have it partially implemented, with WalkTo being done for json. I was going to submit a pull request when I had WalkToMut and WalkFrom done. I haven't ported WalkFrom yet.

I was able to clean it up, or at least that line, by using split_back. I also deleted the original_value from the iterator.

My situation may be niche though, indeed. I need the full path and value of eqch node along a path.

chanced commented 3 weeks ago

@asmello I was mistaken on the split_back. I had to revert it. I submitted a draft PR #83 that has the WalkTo impl. The codebase is a complete mess right now, apologies in advance. I pulled in both my WalkTo and WalkFrom impl and was porting them.

chanced commented 3 weeks ago

@asmello I was on mobile earlier and hadn't read all of your reply - I just wanted to push up the source I had mentioned before responding to the rest.

Believe it or not, I was contemplating ranges as well. My initial inclination was to change the signature of this:

impl Pointer {
    pub fn get(&self, index: usize) -> Option<Token> {}
}

to

impl Pointer {
   // this is incorrect - there should be generics but i have no idea what they'd be
    pub fn get<I: std::slice::SliceIndex>(&self, index: I) -> Option<I::Output> {}
}

and leave the split_at using offsets. However, after a bit of thought, that won't fly. split_at should still mirror slice / Vec behavior here. I also don't know enough about the std::slice::SliceIndex API to know whether or not we can use it mirror slice's get behavior. If it isn't possible, I assume the behavior could be replicated but no idea on that lift.

asmello commented 3 weeks ago

I've been thinking about this more, and I don't really get the use case for WalkTo in general. I tried looking for how it's used in grill, but I couldn't find any references to it?

In general I'm well aligned on the usefulness of traversing a serde_json::Value using a jsonptr::Pointer. This is the primary use case JSON Pointers were designed for, and why we have Resolve and ResolveMut. What I'm not really visualising is how we'd use "partials" (as in, the partial path, and the subtree of the JSON Value) while descending to the target node. Maybe it'd be useful for inspecting the types along the way, in a way that's more robust than just inspecting the pointer itself, but that doesn't really explain why we'd emit the partial path too. I'd like to understand that use case first before we talk about the specifics of the API here.

Separately, I think we can just go ahead and add a slicing API, I think that's actually useful on its own. Will do some experimentation with it.

chanced commented 3 weeks ago

Ah, sorry! Yea, it's a mess over in that repo at the moment. I'm in the middle of rewriting it to use the iterator but I have to swap things over there.

This is going to be probably more explanation than you're hoping for but I'll try and keep it short.

With json schema 2019+(refs existed in 07 but didn't do anything), you can reference a foreign schema by uri. If that URI contains a json pointer, you have to resolve that location and compile only it.

However, ancestors of the schema can set the dialect ($schema), change the base uri via $id, introduce anchors, etc. so while they may not need to be compiled, there is information that has to be checked to see if it influences the target.

What you see there is me trying to compile up from the target. Basically I was going to try and check one level up, see if it contained this schema as a sub schema, if it did, index it (not compile) and then search up for the schema which had it as a sub schema.

Ultimately, I decided that just wasn't worth it. For one, each of those checks requires re-resolving the schema being scanned. Second, the logic was going to be messy and error prone. So I'm going to go back to my initial design (sort of, cleaned up, hopefully), and just scan from the root all the way up to the target. I suspect this may change in the future but it's the easiest and safest way for me to finish.

Basically I'll scan the root, check if it contains a sub schema that has a partial pointer that matches my target, and then skip to that sub schema and repeat. If the schema I'm scanning doesn't contain a subschema with a partial path of my schema, I move to the next and check it until I find one that does or until I hit my target.

In either case, I need the full path as I take it and create a URI with the base uri $id (or whatever they resolved with) and the path as the fragment. Each location within a source is saved as an InternalLink which is persisted in Sources.

I keep all of the root documents as Arc<Value> and locations within it have links. The actual execution / validation of values with the schemas do not use the Sources. All necessary data needed to evaluate is replicated within the keyword's instance - or will be, once I get back to it. Aside from compiling, Sources exists so the values of schemas are present for analysis and eventually serde of the entire Interrogator instance.

This is at least my second time writing the lib - the previous, most complete, version was a hot mess of oop mentality, slammed full of dyn, anymap, etc.

edit

I should mention that the scenario described above explains WalkTo but doesn't explain the need for WalkFrom.

In the case where an external schema is referenced by an unknown anchor(eg example.com/schema#some_anchor), things get a bit more messy.

For unknown anchors, I need to start from the root document and scan it and all descendant schemas. As part of the scan, each schema has one or more links created for it, based on the path and any anchors. After scanning the whole doc, I check Sources for the uri with the anchor. If it has been linked, I have my target.

The reason it isn't simply walk is because in json schema, any schema which has an $id is considered a document root. As such, I may need to start deeply embedded in the actual value and traverse, scanning all possible schemas within it.

* confusingly enough, I also have edits above the edit header. Terribly sorry if you read it as an email only to discover I have edited it without clear distinction. I need to get better at proof reading my replies, especially if I'm on mobile.

chanced commented 3 weeks ago

The more I think about this, the more convinced I am that the walk iterators don't belong in this crate, at least not in their current form. In fact, I may not end up using WalkFrom as I'd burn extra allocations for skipped entries (non-schemas).

I'm certain that a range API is the way to go. It'll allow consumers to craft their own bespoke traversal strategies a great deal more efficiently (at least in terms of dev expenditures).

However, if we are going to double down on it being a collection type, which I think is the right path, we should probably revisit len as well. That's going to be a nasty surprise for anyone that's using it and doesn't read the release notes though - I regret rushing that addition through.

asmello commented 3 weeks ago

However, if we are going to double down on it being a collection type, which I think is the right path, we should probably revisit len as well. That's going to be a nasty surprise for anyone that's using it and doesn't read the release notes though - I regret rushing that addition through.

It's a good point. I don't think the choice we've made was unreasonable. str uses the same definition, and has a similar duality: it's conceptually an ordered collection of char items, but str::len returns the size in bytes. I don't think there's a 'right' answer here regardless of how we proceed with the rest of the API.

We can introduce a Pointer::count as a shortcut to Pointer.tokens().count(), though not sure it's worth it. The latter is quite readable and explicit, even if it takes a couple method calls.

asmello commented 3 weeks ago

Also, perhaps one point in favour of len being the byte length: generally developers expect len to be a O(1). In std, I think that's true for all collection types (even LinkedList).

That said, I suppose we could make it constant time if we kept a token count separately (as the original implementation did), at a small cost in bookkeeping. Will give this some thought, doesn't seem obviously better, but it's worth considering.

chanced commented 3 weeks ago

Yea, this a tough one. I'm not sure what the right call is. Inversely, .as_str().len() gets the byte count.

Lets stew on len, it doesn't need to be decided now.

asmello commented 3 weeks ago

Here's a radical idea: let's get rid of len altogether.

std::path::Path doesn't have it.

We can introduce Pointer::size to replace it. This is a much better name.

Then if one wishes to count tokens, they can use .tokens().count() explicitly. I don't think that's bad. And we can always add Pointer::count later if it proves a pain point.

We can deprecate len for next release (or the one after) to avoid an immediate break.

Let's sit on this for a while, though. 👍

chanced commented 3 weeks ago

I'm game for removing it. I think that's the answer.Path setting the precedent seals the deal for me.

chanced commented 3 weeks ago

Even if we end up adding it back, I think it can be added after it goes through the deprecation cycle. By that point, it'll have been at least 2 minors. Any problems that arise will be due to jumping several minor releases.

asmello commented 3 weeks ago

Added a note in #84 about the complexity of token-based indexing.

chanced / jsonptr

Allow for splitting a `Pointer` by `Token` position (index) #42

edit