JanusGraph / janusgraph

JanusGraph: an open-source, distributed graph database
https://janusgraph.org
Other
5.3k stars 1.17k forks source link

Concatenated full-text search queries not working #1379

Open FlorianHockmann opened 5 years ago

FlorianHockmann commented 5 years ago

This works as expected:

gremlin> g.V().has('Email', 'Address', textContainsRegex('example.com')).
            has('Address', textContainsPrefix('root')).
            values('Address')
==>root@example.com

which uses only predicates for full-text search. If the second predicate is replaced by a string search predicate, then it doesn't work any more:

gremlin> g.V().has('Email', 'Address', textContainsRegex('example.com')).
            has('Address', textPrefix('root')).
            values('Address')
// no result

It works for some reason again if a limit() is added between the two has() steps:

gremlin> g.V().has('Email', 'Address', textContainsRegex('example.com')).
            limit(5).
            has('Address', textPrefix('root')).
            values('Address')
==>root@example.com

The index was created with Mapping.TEXT.asParameter() and a custom analyzer. However the analyzer shouldn't be important here as the second has() can't be answered with the index as it would need a mapping of STRING instead of TEXT.

Versions and backends: JanusGraph: 0.3.1 ScyllaDB: 2.1 Elasticsearch: 5.6.3

porunov commented 5 years ago

@FlorianHockmann could you please provide also the whole command how you are creating the mixed index? How many there addresses with example.com?

FlorianHockmann commented 5 years ago

could you please provide also the whole command how you are creating the mixed index?

Sure, it should have been created like this (we use a script that is auto-generated to create the index, I tried to reconstruct it in a few lines, but there could be syntax errors now):

address = mgmt.makePropertyKey("Address").dataType(String.class).
                cardinality(Cardinality.SINGLE).make()
email = mgmt.makeVertexLabel("Email").make()
mgmt.buildIndex("EmailByAddress, Vertex.class).
    addKey(address, Mapping.TEXT.asParameter(), new Parameter<String>('EmailByAddress', 'email')).
    indexOnly(email).buildMixedIndex("search")

I'm not sure if that's also relevant, but we then use an index template to define how this property should be indexed in Elasticsearch:

{
  "template": "<%= janusgraphKeyspace %>_emailbyaddress*",
  "order": 1,
  "settings": {
    "analysis": {
      "filter": {
        "emailfilter": {
          "type": "pattern_capture",
          "preserve_original": 1,
          "patterns": ["([^@]+)", "(\\\\p{L}+)", "(\\\\d+)", "@(.+)", "([^-@]+)"]
        }
      },
      "analyzer": {
        "emailanalyzer": {
          "tokenizer": "emailtokenizer",
          "filter": ["emailfilter", "lowercase", "unique"]
        }
      },
      "tokenizer": {
        "emailtokenizer": {
          "type": "uax_url_email"
        }
      }
    }
  },
  "version": 1
}

How many there addresses with example.com?

9 right now:

gremlin> g.V().has('EmailMeta', 'Address', textContainsRegex('example.com')).count()
==>9
porunov commented 5 years ago

@FlorianHockmann thanks for the information. I will try to research the issue this week or the next one if no one resolves it before. Not sure if I will find the problem, but if I find something interesting, I will post it here.

Also, I have a question not relevant to the issue, but what is the purpose of using new Parameter<String>('EmailByAddress', 'email')? Where this parameter will be used?

FlorianHockmann commented 5 years ago

I will try to research the issue this week or the next one if no one resolves it before.

That would of course be great. If you need any more information, just ask.

Also, I have a question not relevant to the issue, but what is the purpose of using new Parameter('EmailByAddress', 'email')? Where this parameter will be used?

I'm not 100% sure, but I think we use it to specify the name of the index with field mapping so we can match on that name with the index template I posted above.

porunov commented 5 years ago

@FlorianHockmann I can't create the same template. Getting the error:

{"error":{"root_cause":[{"type":"pattern_syntax_exception","reason":"Illegal repetition near index 3\n(\\p{L}+)\n ^"}],"type":"pattern_syntax_exception","reason":"Illegal repetition near index 3\n(\\p{L}+)\n ^"},"status":400}

I am using the next command:

curl -X PUT "192.168.56.50:9200/_template/janusgraph_emailbyaddress_template" -H 'Content-Type: application/json' -d'
{
  "template": "janusgraph_emailbyaddress*",
  "order": 1,
  "settings": {
    "analysis": {
      "filter": {
        "emailfilter": {
          "type": "pattern_capture",
          "preserve_original": 1,
          "patterns": ["([^@]+)", "(\\\\p{L}+)", "(\\\\d+)", "@(.+)", "([^-@]+)"]
        }
      },
      "analyzer": {
        "emailanalyzer": {
          "tokenizer": "emailtokenizer",
          "filter": ["emailfilter", "lowercase", "unique"]
        }
      },
      "tokenizer": {
        "emailtokenizer": {
          "type": "uax_url_email"
        }
      }
    }
  },
  "version": 1
}
'
FlorianHockmann commented 5 years ago

I don't have access to our setup right now, but it could be that the \ are escaped twice because we use them from an Ansible playbook. Could you try it again with the patterns line like this instead?

"patterns": ["([^@]+)", "(\\p{L}+)", "(\\d+)", "@(.+)", "([^-@]+)"]

I think we got the template mostly from this SO answer which also has the patterns defined like that. Otherwise I'll check on Monday when I'm back in office how our template looks exactly when it's applied to Elasticsearch. Sorry for the inconvenience!

porunov commented 5 years ago

Confirming the issue. I have the same bug without Elasticsearch template.

FlorianHockmann commented 5 years ago

So, you are able to reproduce the issue? That's great to hear as I tried it with a minimal setup but wasn't able to for some reason.

porunov commented 5 years ago

@FlorianHockmann do you know the difference between Text.textContainsPrefix and Text.textPrefix?

When executing the next query:

List<Object> result = g.traversal().V().has("t_Email", "t_Address", Text.textContainsRegex("example.com")).
                has("t_Address", Text.textContainsPrefix("root")).
                values("t_Address").toList();

JanusGraph executes the next query against ElasticSearch:

{
    "from": 0,
    "query": {
        "bool": {
            "must": [
                {
                    "regexp": {
                        "t_Address": "example.com"
                    }
                },
                {
                    "prefix": {
                        "t_Address": "root"
                    }
                }
            ]
        }
    },
    "size": 50
}

When executing the next query:

result = g.traversal().V().has("t_Email", "t_Address", Text.textContainsRegex("example.com")).
                limit(5).
                has("t_Address", Text.textPrefix("root")).
                values("t_Address").toList();

or this one:

result = g.traversal().V().has("t_Email", "t_Address", Text.textContainsRegex("example.com")).
                has("t_Address", Text.textPrefix("root")).
                values("t_Address").toList();

JanusGraph executes the next query against ElasticSearch:

{
    "from": 0,
    "query": {
        "regexp": {
            "t_Address": "example.com"
        }
    },
    "size": 50
}
porunov commented 5 years ago

@FlorianHockmann Yes, I am able to reproduce this issue.

My setup is the next:

private void init(){

    JanusGraphManagement mgmt = graph.openManagement();

    PropertyKey address = mgmt.makePropertyKey("t_Address").dataType(String.class).
            cardinality(Cardinality.SINGLE).make();
    VertexLabel email = mgmt.makeVertexLabel("t_Email").make();
    mgmt.buildIndex("t_emailbyaddress", Vertex.class).
            addKey(address, Mapping.TEXT.asParameter(), new Parameter("t_EmailByAddress", "t_email")).
                    indexOnly(email).buildMixedIndex("search");

    mgmt.commit();

    try {
        ManagementSystem.awaitGraphIndexStatus(graph, "t_emailbyaddress").call();
    } catch (InterruptedException e) {
        e.printStackTrace();
    }

    mgmt = graph.openManagement();

    try {
        mgmt.updateIndex(mgmt.getGraphIndex("t_emailbyaddress"), SchemaAction.REINDEX).get();
    } catch (InterruptedException | ExecutionException e) {
        e.printStackTrace();
    }

    mgmt.commit();

    for(int i=0;i<9;i++){
        String root;
        if(i==0){
            root = "root";
        } else {
            root = "aroot"+i;
        }
        graph.addVertex(T.label, "t_Email", "t_Address", root+"@example.com");
    }

    graph.tx().commit();
}

Testing queries are the next:

private void queryTests(){
    List<Object> result = g.traversal().V().has("t_Email", "t_Address", Text.textContainsRegex("example.com")).
            has("t_Address", Text.textContainsPrefix("root")).
            values("t_Address").toList();

    result = g.traversal().V().has("t_Email", "t_Address", Text.textContainsRegex("example.com")).
            limit(5).
            has("t_Address", Text.textPrefix("root")).
            values("t_Address").toList();

    result = g.traversal().V().has("t_Email", "t_Address", Text.textContainsRegex("example.com")).
            has("t_Address", Text.textPrefix("root")).
            values("t_Address").toList();
}

The first two queries return a list with one element root@example.com. The 3rd query returns an empty list. JanusGraph version: 0.3.1 ElasticSearch version: 6.5.4 ScyllaDB version: 2.3.1

FlorianHockmann commented 5 years ago

do you know the difference between Text.textContainsPrefix and Text.textPrefix?

textContainsPrefix (like all textContains* predicates) is a full-text search predicate which means that it works on a tokenized version of the string, e.g, on individual words. This can be used on indices that were created with Mapping.TEXT. textPrefix on the other hand is a string search predicate which works on the string as is, without any tokenization. It is available for indices with Mapping.STRING.

The index here is created with Mapping.TEXT which means that only predicates starting with textContains (the full-text predicates) can use the index. Therefore, it is correct that JanusGraph won't include the textPrefix predicate in the query to Elasticsearch. Instead, it needs to filter the results returned from the first textContainsRegex in-memory itself.

Yes, I am able to reproduce this issue.

Great, so it's not specific somehow to our setup. That will hopefully make it easier to find the problem.

porunov commented 5 years ago

P.eq(value) also don't work in this scenario. I.e. The query returns a list with the element root@example.com:

result = g.traversal().V().has("t_Email", "t_Address", Text.textContainsRegex("example.com")).
        limit(5).
        has("t_Address", P.eq("root@example.com")).
        values("t_Address").toList();

And this query doesn't return anything:

result = g.traversal().V().has("t_Email", "t_Address", Text.textContainsRegex("example.com")).
        has("t_Address", P.eq("root@example.com")).
        values("t_Address").toList();

I've noticed that JanusGraph actually uses textPrefix in the lucene query but it doesn't use it in the ElasticSearch query. I think like JanusGraph thinks that all has steps should be combined when they are used against an index backend. But I am not sure.

I've just noticed in the constructor of JanusGraphStep that when we use in your query .limit(5) then it generates next GraphCentricQuery here: ~label = t_Email AND t_Address textContainsRegex example.com

When we don't use .limit(5) the next query is generated: ~label = t_Email AND t_Address textContainsRegex example.com AND t_Address textPrefix root

Not sure if I am digging in the right direction but maybe it will help somehow.

Update: ElasticSearchIndex gets the next IndexQuery: [(t_Address textContainsRegex example.com)]:t_emailbyaddress

ChenZhaobin commented 3 years ago

@farodin91 @FlorianHockmann @porunov Hi , guys ,how long can we get this issue fixed? it seems very important and should be a very common used feature anyway.

ChenZhaobin commented 3 years ago

or do we have an alternative solution for this?

FlorianHockmann commented 3 years ago

@ChenZhaobin JanusGraph is completely maintained by an open community of volunteers so it's hard to give any estimates on when a certain issue will be fixed. Someone would need to look into these issues to investigate them further and then submit a PR with a fix.

If these issues are important to you and you have some time, then you could also try to familiarize yourself with the relevant parts of the JanusGraph codebase to try to find a fix. That would definitely be a very welcome contribution to the project!

ChenZhaobin commented 3 years ago

@FlorianHockmann @porunov I'd be absolutely delighted to contribute to this project, could you please give some instruction on how to run and debug this project ? such as when I invoke a janusgraph query ,how do I trace and find the source query in ES and cassandra? Actucally ,for now ,I only familar with how to config and run the docker container,I think in order to get this issue fixed ,I should dig into the code level to debug what's really matters.

FlorianHockmann commented 3 years ago

Sure, although it's a bit hard to give general recommendations on how to debug something in JanusGraph. I would just start by checking out the project, opening it in your favorite Java IDE and then try to execute some tests that seem relevant to this issue. You can then try to add another test case that reproduces the issue. That should be a good starting point to use a debugger to better understand what's happening in JanusGraph. I would for example add some breakpoints in relevant parts of the code base to see in action what's happening there when your problematic query gets executed. If you have specific problems or questions on how to test or debug JanusGraph, then you can also ask directly in our community channels for development: