Open Robbie1977 opened 4 years ago
Can you give me a full example of the target string that you want to see in KB and in PDB?
In PDB I'd be happy with no escaping but should probably match the other JSON snippets The parameters in question are on MATCH (n:Template)-[r]->(n) RETURN r LIMIT 25 edges
"extent": "{\"X\":1023,\"Y\":511,\"Z\":217}"
"voxel": "{\"X\":0.62208,\"Y\":0.62208,\"Z\":0.62208}",
"center": "{\"X\":511,\"Y\":255,\"Z\":108}"
So we're going to remove the double quotes from the string just to get it loaded for now.
As a general principle, I think we should avoid having to quote strings in the DB for the pipeline to run. If at all possible, the pipeline should deal gracefully with quoting as needed during loading, leaving the strings in databases and triplestores unquoted. Anything else leads to mistmatches in quoting needs between pipeline steps, making the whole system brittle.
The problem appears to be with conversion of owl to neo when loading pdb. Both KB and the triplestore look OK:
In the triplestore:
prefix owl: <http://www.w3.org/2002/07/owl#>
prefix vfb: <http://virtualflybrain.org/reports/>
prefix n2np: <http://n2o.neo/property/>
prefix RO: <http://purl.obolibrary.org/obo/RO_>
SELECT ?edgeProp ?val
WHERE {
[ rdf:type owl:Axiom ;
owl:annotatedSource vfb:VFBc_00017894 ;
owl:annotatedProperty RO:0002026 ;
owl:annotatedTarget vfb:VFBc_00017894 ;
?edgeProp ?val
] .
}
=>
EdgeProp | Val |
---|---|
rdf:type | owl:Axiom |
owl:annotatedSource | http://virtualflybrain.org/reports/VFBc_00017894 |
owl:annotatedProperty | http://purl.obolibrary.org/obo/RO_0002026 |
owl:annotatedTarget | http://virtualflybrain.org/reports/VFBc_00017894 |
http://n2o.neo/custom/center | "{"X":511,"Y":255,"Z":108}" |
http://n2o.neo/custom/extent | "{"X":1023,"Y":511,"Z":217}" |
http://n2o.neo/custom/filename | "JFRC2_template_mask130819" |
http://n2o.neo/custom/folder | "http://www.virtualflybrain.org/data/VFB/i/0001/7894/" |
http://n2o.neo/custom/index | "0.0"^^xsd:float |
http://n2o.neo/custom/orientation | "LIP" |
http://n2o.neo/custom/typ | "Related" |
http://n2o.neo/custom/voxel | "{"X":0.62208,"Y":0.62208,"Z":0.62208}" |
But edge addition fails (silently?!) for this edge when loading pdb, presumably because escaping needs to be added to values in the csv tables used in loading. THIS ISSUE BREAKS THE PIPELINE IN FUNDAMENTAL WAYS. FIXING IS HIGH PRIORITY.
A second, more minor issue, is that, if at all possible, escapes present in strings in the DB should not be stripped by the pipeline. This is currently happening in the KB2KB step. I'd class if as a minor irritant.
Nico has identified the likely offending line:
Looks like this (slightly odd) quoting will do the trick:
https://neo4j.com/developer/kb/how-do-i-use-load-csv-with-data-including-quotes/
Literal "
-> ""
(HT Robbie).
I tried some tests. Using this cypher:
MATCH (t:Template) WHERE t.short_form = line.template
MERGE (t)-[r:in_register_with]-(t) SET r.voxel = line.voxel
All three of these csv lines successfully added the the annotated edge.
template | voxel |
---|---|
test_123 | {"X":0.62208,"Y":0.62208, "Z":0.62208} |
test_123 | {""X"":0.62208,""Y"":0.62208, ""Z"":0.62208} |
test_123 | "{""X"":0.62208,""Y"":0.62208, ""Z"":0.62208}" |
Only this one failed
template | voxel |
---|---|
test_123 | "{"X":0.62208,"Y":0.62208, "Z":0.62208}" |
=> API returning an error {'code': 'Neo.DatabaseError.Statement.ExecutionFailed', 'message': '...' }
So - looks like there are a few options for fixing. Would be good to get fix in place ASAP.
The other issue that concerns me is silent failure. This is presumably is not due to a MATCH fail but due to bad cypher, so I'd expect an exception (from Statement.ExecutionFailed as above) => pipeline fail. Is better exception handling needed?
I will think a bit about this, but unless neo4j itself complains that it didnt add something that was actually a row in a TSV, I cant do much; I would have thought that a badly formed CSV would cause a failure (I will check that later this week). I would suggest it is better to create a more complete suite of healthchecks, where you count number of edges you expect and number of edges actually added etc systematically as a post-processing pipeline. This could be trivially achieved with a system that compares SPARQL results in the integration layer with CYPHER results in KB, and PDB. I think this would be a cool ticket to work on; I would be happy to do it, but maybe there are some more urgent things to do first :) But if you want, I can do it.
but unless neo4j itself complains that it didn't add something that was actually a row in a TSV
Are you sure it isn't complaining? I'd have expected it to fail with a cypher error (see example above). Would it be possible to see an example of of how the quoting happens right now? i.e. what does .quoteAsString
do to this: {"X":0.62208,"Y":0.62208, "Z":0.62208}
?
This might be one for @Robbie1977 to fix - but it looks to me like query execution errors do not => build fail here:
https://jenkins.virtualflybrain.org/view/pip_pipeline2/job/pip_vfb-prod/372/console e.g.
ERROR: An error has occurred.. 20719.241 sec
java.util.concurrent.ExecutionException: org.neo4j.graphdb.QueryExecutionException: Property values can only be of primitive types or arrays thereof
Also @Robbie1977 - is it possible to recover this file. file:/relationship_in_register_with.txt ?
I'd be very surprised if the offending lines in this file do not => an error like this: Neo.DatabaseError.Statement.ExecutionFailed
. But I don't see any sign of a log message reporting this. I have a quick test I can run locally, given CSV (one example line would do) + Cypher. If this is being buried, we need to unbury it and use it to trigger a build fail. Also - if these error messaged are being buried, then revealing them would make debugging much easier.
I am afraid I am cleaning those up (to save disk space, and not confuse neo4j2owl when running multiple times); I could probably add a function that zips these files up so they can be moved after the run ist finished.
Don't worry for now if too much faff to store . The main things I want fixed:
"
@matentzn With these in place we can chase up other build errors we've missed to date that are visible in the log.
@Robbie1977 - are you happy that this is an efficient approach?
Test whether malformed Cypher errors are returned (could test with any malformed cypher) and if not, make sure they are (assuming possible - would be surprised if not, but API used is different from REST API I'm familiar with)
-> I assume you mean in the context of the plugin? I am 99% sure that yes, but I will check it again. @dosumis
See also: https://neo4j.com/developer/kb/how-do-i-use-load-csv-with-data-including-quotes/
Please double check that this is correct:
"{\"X\":1}"
{"X":1}
{""X"":1}
"{\"X\":1}"
(like OWL)As a general principle, I think we should avoid having to quote strings in the DB for the pipeline to run. If at all possible, the pipeline should deal gracefully with quoting as needed during loading, leaving the strings in databases and triplestores unquoted. Anything else leads to mistmatches in quoting needs between pipeline steps, making the whole system brittle.
Is the quoting required in OWL?
In some important serialisations yes -> most importantly ttl
Escaping of "
with \
is not required in Neo. Better if we can avoid, otherwise it needs to be stripped out before converting to JSON.
Fixing this ticket was really tricky for me in the end (don't get fooled by the solution being only two lines of code), and the solution leaves one with a grumble of dissatisfaction. But it is the best I could come up with. The problem was that for CSV import in neo4j, both "
and \
can be used as escapes for "
, like ""
or \"
. The pipeline was failing last week because some FBBT classes have quotes in definitions, which themselves get encoded JSON, see for example:
MATCH (n:Class {iri:'http://purl.obolibrary.org/obo/FBbt_00001954'}) RETURN n.definition
So escaping "
with ""
lead to this construct in the FBBT definition \""
which caused the CSV importer to fail because of a misplaced quote sign. Anyways, the new solution is simply this:
val = val.replaceAll("(?<![\\\\])[\"]","\"\"");
val = val.replaceAll("[\\\\]+[\"]","\\\\\\\\\"\"");
The first regex will escape every quote that is not preceeded by a \
like so: ""
.
The second regex will find all cases where a quote is (and here comes a slight hack) preceded by one or more slashes, and turns that into \\""
.
I think this catches all cases apart from one (which I just could not figure out:
This is an example string that by some reason has more than one \\\\\\\" followed by a quote.
This would be translated to:
This is an example string that by some reason has more than one \\"" followed by a quote.
And roundtrip to:
This is an example string that by some reason has more than one \" followed by a quote.
Hope that is acceptable. The regex required to do this right would be something like "if a quote is preceded by a number of backslashes, double the number of backslashes" -> now idea how you would do that.
Writing all that led me to wonder why I dont just do:
val = val.replaceAll("[\"]","\"\"");
val = val.replaceAll("[\\\\]","\\\\\\\\");
separately escaping quotes and backslashes.. Unit tests pass, I will try now on whole VFB... Sigh. Not my strong suite.
Well, this was probably my dumbest 3 wasted hours of the year! The last suggestion of course works normally. Its a bit of a pity that neo4j java api does not contain a class for csv escaping, but I guess this covers it! Seems to work, try it! See for example:
MATCH (n:Class {iri:'http://purl.obolibrary.org/obo/FBbt_00001954'}) RETURN n.definition
So looks like all code that uses these now needs to de-escape quoting of JSON keys and values. I can live with that.
Quick question - why did you avoid the double double quoting strategy in loading? This leaves the JSON intact. (You may still need backslash escaping inside values, but at least this wouldn't mess with loading.)
See slack, for more discussion about this, but the valid JSON string:
"X":"HEllo this is \"OMEGA\""
currently has the escape because its JSON; is this what you mean? This gets escaped to this for CSV loading:
""X"":""HEllo this is \\""OMEGA\\""""
Converting to list is losing any escapes that are artificially forced into KBW so probably just worth adding required quoting at conversion step?