Open FlorianHockmann opened 5 years ago
@FlorianHockmann could you please provide also the whole command how you are creating the mixed index?
How many there addresses with example.com
?
could you please provide also the whole command how you are creating the mixed index?
Sure, it should have been created like this (we use a script that is auto-generated to create the index, I tried to reconstruct it in a few lines, but there could be syntax errors now):
address = mgmt.makePropertyKey("Address").dataType(String.class).
cardinality(Cardinality.SINGLE).make()
email = mgmt.makeVertexLabel("Email").make()
mgmt.buildIndex("EmailByAddress, Vertex.class).
addKey(address, Mapping.TEXT.asParameter(), new Parameter<String>('EmailByAddress', 'email')).
indexOnly(email).buildMixedIndex("search")
I'm not sure if that's also relevant, but we then use an index template to define how this property should be indexed in Elasticsearch:
{
"template": "<%= janusgraphKeyspace %>_emailbyaddress*",
"order": 1,
"settings": {
"analysis": {
"filter": {
"emailfilter": {
"type": "pattern_capture",
"preserve_original": 1,
"patterns": ["([^@]+)", "(\\\\p{L}+)", "(\\\\d+)", "@(.+)", "([^-@]+)"]
}
},
"analyzer": {
"emailanalyzer": {
"tokenizer": "emailtokenizer",
"filter": ["emailfilter", "lowercase", "unique"]
}
},
"tokenizer": {
"emailtokenizer": {
"type": "uax_url_email"
}
}
}
},
"version": 1
}
How many there addresses with example.com?
9 right now:
gremlin> g.V().has('EmailMeta', 'Address', textContainsRegex('example.com')).count()
==>9
@FlorianHockmann thanks for the information. I will try to research the issue this week or the next one if no one resolves it before. Not sure if I will find the problem, but if I find something interesting, I will post it here.
Also, I have a question not relevant to the issue, but what is the purpose of using new Parameter<String>('EmailByAddress', 'email')
? Where this parameter will be used?
I will try to research the issue this week or the next one if no one resolves it before.
That would of course be great. If you need any more information, just ask.
Also, I have a question not relevant to the issue, but what is the purpose of using new Parameter
('EmailByAddress', 'email')? Where this parameter will be used?
I'm not 100% sure, but I think we use it to specify the name of the index with field mapping so we can match on that name with the index template I posted above.
@FlorianHockmann I can't create the same template. Getting the error:
{"error":{"root_cause":[{"type":"pattern_syntax_exception","reason":"Illegal repetition near index 3\n(\\p{L}+)\n ^"}],"type":"pattern_syntax_exception","reason":"Illegal repetition near index 3\n(\\p{L}+)\n ^"},"status":400}
I am using the next command:
curl -X PUT "192.168.56.50:9200/_template/janusgraph_emailbyaddress_template" -H 'Content-Type: application/json' -d'
{
"template": "janusgraph_emailbyaddress*",
"order": 1,
"settings": {
"analysis": {
"filter": {
"emailfilter": {
"type": "pattern_capture",
"preserve_original": 1,
"patterns": ["([^@]+)", "(\\\\p{L}+)", "(\\\\d+)", "@(.+)", "([^-@]+)"]
}
},
"analyzer": {
"emailanalyzer": {
"tokenizer": "emailtokenizer",
"filter": ["emailfilter", "lowercase", "unique"]
}
},
"tokenizer": {
"emailtokenizer": {
"type": "uax_url_email"
}
}
}
},
"version": 1
}
'
I don't have access to our setup right now, but it could be that the \
are escaped twice because we use them from an Ansible playbook. Could you try it again with the patterns
line like this instead?
"patterns": ["([^@]+)", "(\\p{L}+)", "(\\d+)", "@(.+)", "([^-@]+)"]
I think we got the template mostly from this SO answer which also has the patterns
defined like that.
Otherwise I'll check on Monday when I'm back in office how our template looks exactly when it's applied to Elasticsearch. Sorry for the inconvenience!
Confirming the issue. I have the same bug without Elasticsearch template.
So, you are able to reproduce the issue? That's great to hear as I tried it with a minimal setup but wasn't able to for some reason.
@FlorianHockmann do you know the difference between Text.textContainsPrefix
and Text.textPrefix
?
When executing the next query:
List<Object> result = g.traversal().V().has("t_Email", "t_Address", Text.textContainsRegex("example.com")).
has("t_Address", Text.textContainsPrefix("root")).
values("t_Address").toList();
JanusGraph executes the next query against ElasticSearch:
{
"from": 0,
"query": {
"bool": {
"must": [
{
"regexp": {
"t_Address": "example.com"
}
},
{
"prefix": {
"t_Address": "root"
}
}
]
}
},
"size": 50
}
When executing the next query:
result = g.traversal().V().has("t_Email", "t_Address", Text.textContainsRegex("example.com")).
limit(5).
has("t_Address", Text.textPrefix("root")).
values("t_Address").toList();
or this one:
result = g.traversal().V().has("t_Email", "t_Address", Text.textContainsRegex("example.com")).
has("t_Address", Text.textPrefix("root")).
values("t_Address").toList();
JanusGraph executes the next query against ElasticSearch:
{
"from": 0,
"query": {
"regexp": {
"t_Address": "example.com"
}
},
"size": 50
}
@FlorianHockmann Yes, I am able to reproduce this issue.
My setup is the next:
private void init(){
JanusGraphManagement mgmt = graph.openManagement();
PropertyKey address = mgmt.makePropertyKey("t_Address").dataType(String.class).
cardinality(Cardinality.SINGLE).make();
VertexLabel email = mgmt.makeVertexLabel("t_Email").make();
mgmt.buildIndex("t_emailbyaddress", Vertex.class).
addKey(address, Mapping.TEXT.asParameter(), new Parameter("t_EmailByAddress", "t_email")).
indexOnly(email).buildMixedIndex("search");
mgmt.commit();
try {
ManagementSystem.awaitGraphIndexStatus(graph, "t_emailbyaddress").call();
} catch (InterruptedException e) {
e.printStackTrace();
}
mgmt = graph.openManagement();
try {
mgmt.updateIndex(mgmt.getGraphIndex("t_emailbyaddress"), SchemaAction.REINDEX).get();
} catch (InterruptedException | ExecutionException e) {
e.printStackTrace();
}
mgmt.commit();
for(int i=0;i<9;i++){
String root;
if(i==0){
root = "root";
} else {
root = "aroot"+i;
}
graph.addVertex(T.label, "t_Email", "t_Address", root+"@example.com");
}
graph.tx().commit();
}
Testing queries are the next:
private void queryTests(){
List<Object> result = g.traversal().V().has("t_Email", "t_Address", Text.textContainsRegex("example.com")).
has("t_Address", Text.textContainsPrefix("root")).
values("t_Address").toList();
result = g.traversal().V().has("t_Email", "t_Address", Text.textContainsRegex("example.com")).
limit(5).
has("t_Address", Text.textPrefix("root")).
values("t_Address").toList();
result = g.traversal().V().has("t_Email", "t_Address", Text.textContainsRegex("example.com")).
has("t_Address", Text.textPrefix("root")).
values("t_Address").toList();
}
The first two queries return a list with one element root@example.com
. The 3rd query returns an empty list.
JanusGraph version: 0.3.1
ElasticSearch version: 6.5.4
ScyllaDB version: 2.3.1
do you know the difference between Text.textContainsPrefix and Text.textPrefix?
textContainsPrefix
(like all textContains*
predicates) is a full-text search predicate which means that it works on a tokenized version of the string, e.g, on individual words. This can be used on indices that were created with Mapping.TEXT
.
textPrefix
on the other hand is a string search predicate which works on the string as is, without any tokenization. It is available for indices with Mapping.STRING
.
The index here is created with Mapping.TEXT
which means that only predicates starting with textContains
(the full-text predicates) can use the index. Therefore, it is correct that JanusGraph won't include the textPrefix
predicate in the query to Elasticsearch. Instead, it needs to filter the results returned from the first textContainsRegex
in-memory itself.
Yes, I am able to reproduce this issue.
Great, so it's not specific somehow to our setup. That will hopefully make it easier to find the problem.
P.eq(value)
also don't work in this scenario. I.e.
The query returns a list with the element root@example.com
:
result = g.traversal().V().has("t_Email", "t_Address", Text.textContainsRegex("example.com")).
limit(5).
has("t_Address", P.eq("root@example.com")).
values("t_Address").toList();
And this query doesn't return anything:
result = g.traversal().V().has("t_Email", "t_Address", Text.textContainsRegex("example.com")).
has("t_Address", P.eq("root@example.com")).
values("t_Address").toList();
I've noticed that JanusGraph actually uses textPrefix
in the lucene query but it doesn't use it in the ElasticSearch query. I think like JanusGraph thinks that all has
steps should be combined when they are used against an index backend. But I am not sure.
I've just noticed in the constructor of JanusGraphStep
that when we use in your query .limit(5)
then it generates next GraphCentricQuery
here:
~label = t_Email AND t_Address textContainsRegex example.com
When we don't use .limit(5)
the next query is generated:
~label = t_Email AND t_Address textContainsRegex example.com AND t_Address textPrefix root
Not sure if I am digging in the right direction but maybe it will help somehow.
Update: ElasticSearchIndex gets the next IndexQuery:
[(t_Address textContainsRegex example.com)]:t_emailbyaddress
@farodin91 @FlorianHockmann @porunov Hi , guys ,how long can we get this issue fixed? it seems very important and should be a very common used feature anyway.
or do we have an alternative solution for this?
@ChenZhaobin JanusGraph is completely maintained by an open community of volunteers so it's hard to give any estimates on when a certain issue will be fixed. Someone would need to look into these issues to investigate them further and then submit a PR with a fix.
If these issues are important to you and you have some time, then you could also try to familiarize yourself with the relevant parts of the JanusGraph codebase to try to find a fix. That would definitely be a very welcome contribution to the project!
@FlorianHockmann @porunov I'd be absolutely delighted to contribute to this project, could you please give some instruction on how to run and debug this project ? such as when I invoke a janusgraph query ,how do I trace and find the source query in ES and cassandra? Actucally ,for now ,I only familar with how to config and run the docker container,I think in order to get this issue fixed ,I should dig into the code level to debug what's really matters.
Sure, although it's a bit hard to give general recommendations on how to debug something in JanusGraph. I would just start by checking out the project, opening it in your favorite Java IDE and then try to execute some tests that seem relevant to this issue. You can then try to add another test case that reproduces the issue. That should be a good starting point to use a debugger to better understand what's happening in JanusGraph. I would for example add some breakpoints in relevant parts of the code base to see in action what's happening there when your problematic query gets executed. If you have specific problems or questions on how to test or debug JanusGraph, then you can also ask directly in our community channels for development:
This works as expected:
which uses only predicates for full-text search. If the second predicate is replaced by a string search predicate, then it doesn't work any more:
It works for some reason again if a
limit()
is added between the twohas()
steps:The index was created with
Mapping.TEXT.asParameter()
and a custom analyzer. However the analyzer shouldn't be important here as the secondhas()
can't be answered with the index as it would need a mapping ofSTRING
instead ofTEXT
.Versions and backends: JanusGraph: 0.3.1 ScyllaDB: 2.1 Elasticsearch: 5.6.3