bblfsh / sdk

Babelfish driver SDK
GNU General Public License v3.0
23 stars 27 forks source link

xpath query is not validated #332

Open kuba-- opened 5 years ago

kuba-- commented 5 years ago

The issue is related to:

Generally SDK uses 3rd-party package (github.com/antchfx/xpath) to compile xpath queries. Unfortunately lot of incorrect xpaths passes, e.g.:

    x, err = Compile("//*[@role=\"Return\"]")
    if err != nil {
        t.Fatalf("should be correct but got error %s", err)
    }
    t.Logf("%v\n", x)

    x, err = Compile("BAD")
    if err != nil {
        t.Fatalf("should be correct but got error %s", err)
    }
    t.Logf("%v\n", x)

    x, err = Compile("this is a totally crap. For sure not valid @xpath")
    if err != nil {
        t.Fatalf("should be correct but got error %s", err)
    }
    t.Logf("%v\n", x)
dennwc commented 5 years ago

Yes, the library isn't ideal. We should fix it upstream.

But also there were few discussions that XPath might not be the best language going forward, so maybe we will just use a new language? Will need to decide at some point.

kuba-- commented 5 years ago

Ok. I'm wondering if graphql would be a good choice here.

dennwc commented 5 years ago

Real GraphQL needs a srits schema and we don't have one by design.

But I've implemented a GraphQL endpoint with dynamic schema for Cayley, so in practice it can be done.

The main problem is that GraphQL still cannot change the shape of a response. If you are querying multiple nested nodes, but only want few value extracted from them, you will end up requesting a skeleton of that subtree with few values in there. We need to check if there are any alternatives that can allow querying exactly what the user needs.

smola commented 5 years ago

Keep in mind that we do support, recommend and demo XPath today. Feel free to open a separate issue and start discussing proposals for the future, but this is a real issue today.

juanjux commented 5 years ago

Since the errors on the test code always say the same, so it's not clear to me what of the examples are supossed to be right or wrong, but I've tested with an online XPath validator. I've also tested using the client-python v3 (which uses the SDK and thus the library).

  1. The first example works on the web (and produce results) even if it uses double quotes. It also works with the Python client.
  2. The second example doesn't produce any error so it seems to be valid XPath. Doesn't fail either with the client, just produce empty results on both places.
  3. Third example doesn't validate with the web but doesn't produce any error with the client, so this one really seems to be a bug.
bzz commented 5 years ago

@kuba-- @ajnavarro if you guys have any thoughts on results of @juanjux preliminary research - please feel free to share

kuba-- commented 5 years ago

For me it's all about optimization. If we can reject wrong query earlier instead of waiting till the end to say "no results" (because query is no valid) then it's better. In other words we'll take what LA team give us ;) The open question is - do we wanna change 3rd library to parse xpath, or stick to what we use so far? @ajnavarro - correct me if I'm wrong.

ajnavarro commented 5 years ago

@kuba-- totally agree.

I think the problem here is that almost any string is a valid XPath query. Maybe we should add some extra validation for our use case, like the supported node names. Per example, if the user tries to query //uast:Idenitfier instead of //uast:Identifier, in my opinion, we should throw an error before executing the query.

kuba-- commented 5 years ago

We may introduce waterfall of validators. XPath library should validate just xpath. So, bblfsh will be sure that got valid xpath and can focus on validating own syntax and reject e.g. non-uast queries: //ast:Paradigm

dennwc commented 5 years ago

This will require us to have a strict schema exposed to XPath. And before even trying this we will need to refactor XPath library to allow such hooks to be installed.

kuba-- commented 5 years ago

No worries, I can help ;) But frankly, this is just a wish. Actually we already have a kind of schema, but not written down, to let us generate validators. It's like having yacc file for our xpath. I believe, any syntax tree cannot be universal if it doesn't have universal validators. It's like a programming language where in theory you can use any syntax but sometimes it will return something and sometimes not. It should't work like this. If you use fn instead of func it should tell you that this is wrong, even if both cases it's a valid literal.

dennwc commented 5 years ago

In general, I think it's a good idea, but we are not ready for it right now.

For example "our" XPath is well defined for Semantic nodes (uast:*), but we know nearly nothing about nodes unique to each programming language (cpp:*). I have plans to derive this schema later, but for now, it's only possible to assert for uast:-related queries.

kuba-- commented 5 years ago

So let's do it in "baby steps". Personally, I prefer many small features/tools which we can add one by one instead of one big 🎉 effect.

dennwc commented 5 years ago

Totally agree! We can start by making an XPath library better. It's a bit messy right now.

kuba-- commented 5 years ago

Talking about "baby steps", how about changing the current xpath library (github.com/antchfx/xpath) to the gopkg.in/xmlpath.v2?

At least in following test it can find the last invalid xpath:

package main

import (
    . "gopkg.in/xmlpath.v2"
    // . "github.com/antchfx/xpath"
)
func main() {
    MustCompile("//*[@role=\"Return\"]")
    MustCompile("BAD")

    MustCompile("this is a totally crap. For sure not valid @xpath") // <----
}
dennwc commented 5 years ago

I checked the library briefly, it looks better but has one downside: it cannot work with virtual nodes, only with full XML tree.

Still, it may be an option to fork the library and make it work with some node interface instead of a specific type.