VirtualFlyBrain / neo4j2owl

Semantics preserving mapping of OWL 2 EL and Neo4J. Under development, do not use.
Apache License 2.0
6 stars 4 forks source link

Pipeline 2 should be able to handle double quotes (and any other special characters) in neo property (AP literal) values. #44

Open Robbie1977 opened 4 years ago

Robbie1977 commented 4 years ago

Converting to list is losing any escapes that are artificially forced into KBW so probably just worth adding required quoting at conversion step?

matentzn commented 4 years ago

Can you give me a full example of the target string that you want to see in KB and in PDB?

Robbie1977 commented 4 years ago

In PDB I'd be happy with no escaping but should probably match the other JSON snippets The parameters in question are on MATCH (n:Template)-[r]->(n) RETURN r LIMIT 25 edges

Robbie1977 commented 4 years ago

"extent": "{\"X\":1023,\"Y\":511,\"Z\":217}"

"voxel": "{\"X\":0.62208,\"Y\":0.62208,\"Z\":0.62208}",
  "center": "{\"X\":511,\"Y\":255,\"Z\":108}"
Robbie1977 commented 4 years ago

So we're going to remove the double quotes from the string just to get it loaded for now.

dosumis commented 4 years ago

As a general principle, I think we should avoid having to quote strings in the DB for the pipeline to run. If at all possible, the pipeline should deal gracefully with quoting as needed during loading, leaving the strings in databases and triplestores unquoted. Anything else leads to mistmatches in quoting needs between pipeline steps, making the whole system brittle.

The problem appears to be with conversion of owl to neo when loading pdb. Both KB and the triplestore look OK:

In the triplestore:

prefix owl: <http://www.w3.org/2002/07/owl#>
prefix vfb: <http://virtualflybrain.org/reports/>
prefix n2np: <http://n2o.neo/property/>
prefix RO: <http://purl.obolibrary.org/obo/RO_>

SELECT ?edgeProp ?val
WHERE { 
[ rdf:type owl:Axiom ;
   owl:annotatedSource vfb:VFBc_00017894 ;
   owl:annotatedProperty RO:0002026 ;
   owl:annotatedTarget vfb:VFBc_00017894 ;
   ?edgeProp ?val
 ] .
}

=>

EdgeProp Val
rdf:type owl:Axiom
owl:annotatedSource http://virtualflybrain.org/reports/VFBc_00017894
owl:annotatedProperty http://purl.obolibrary.org/obo/RO_0002026
owl:annotatedTarget http://virtualflybrain.org/reports/VFBc_00017894
http://n2o.neo/custom/center "{"X":511,"Y":255,"Z":108}"
http://n2o.neo/custom/extent "{"X":1023,"Y":511,"Z":217}"
http://n2o.neo/custom/filename "JFRC2_template_mask130819"
http://n2o.neo/custom/folder "http://www.virtualflybrain.org/data/VFB/i/0001/7894/"
http://n2o.neo/custom/index "0.0"^^xsd:float
http://n2o.neo/custom/orientation "LIP"
http://n2o.neo/custom/typ "Related"
http://n2o.neo/custom/voxel "{"X":0.62208,"Y":0.62208,"Z":0.62208}"

But edge addition fails (silently?!) for this edge when loading pdb, presumably because escaping needs to be added to values in the csv tables used in loading. THIS ISSUE BREAKS THE PIPELINE IN FUNDAMENTAL WAYS. FIXING IS HIGH PRIORITY.

A second, more minor issue, is that, if at all possible, escapes present in strings in the DB should not be stripped by the pipeline. This is currently happening in the KB2KB step. I'd class if as a minor irritant.

dosumis commented 4 years ago

Nico has identified the likely offending line:

https://github.com/VirtualFlyBrain/neo4j2owl/blob/450f14ba037f6fc77769ec59245d93d7949e0949/src/main/java/ebi/spot/neo4j2owl/importer/N2OCSVWriter.java#L158

Looks like this (slightly odd) quoting will do the trick: https://neo4j.com/developer/kb/how-do-i-use-load-csv-with-data-including-quotes/ Literal " -> "" (HT Robbie).

I tried some tests. Using this cypher:

MATCH (t:Template) WHERE t.short_form = line.template 
MERGE (t)-[r:in_register_with]-(t) SET r.voxel = line.voxel

All three of these csv lines successfully added the the annotated edge.

template voxel
test_123 {"X":0.62208,"Y":0.62208, "Z":0.62208}
test_123 {""X"":0.62208,""Y"":0.62208, ""Z"":0.62208}
test_123 "{""X"":0.62208,""Y"":0.62208, ""Z"":0.62208}"

Only this one failed

template voxel
test_123 "{"X":0.62208,"Y":0.62208, "Z":0.62208}"

=> API returning an error {'code': 'Neo.DatabaseError.Statement.ExecutionFailed', 'message': '...' }

So - looks like there are a few options for fixing. Would be good to get fix in place ASAP.

The other issue that concerns me is silent failure. This is presumably is not due to a MATCH fail but due to bad cypher, so I'd expect an exception (from Statement.ExecutionFailed as above) => pipeline fail. Is better exception handling needed?

matentzn commented 4 years ago

I will think a bit about this, but unless neo4j itself complains that it didnt add something that was actually a row in a TSV, I cant do much; I would have thought that a badly formed CSV would cause a failure (I will check that later this week). I would suggest it is better to create a more complete suite of healthchecks, where you count number of edges you expect and number of edges actually added etc systematically as a post-processing pipeline. This could be trivially achieved with a system that compares SPARQL results in the integration layer with CYPHER results in KB, and PDB. I think this would be a cool ticket to work on; I would be happy to do it, but maybe there are some more urgent things to do first :) But if you want, I can do it.

dosumis commented 4 years ago

but unless neo4j itself complains that it didn't add something that was actually a row in a TSV

Are you sure it isn't complaining? I'd have expected it to fail with a cypher error (see example above). Would it be possible to see an example of of how the quoting happens right now? i.e. what does .quoteAsString do to this: {"X":0.62208,"Y":0.62208, "Z":0.62208} ?

dosumis commented 4 years ago

This might be one for @Robbie1977 to fix - but it looks to me like query execution errors do not => build fail here:

https://jenkins.virtualflybrain.org/view/pip_pipeline2/job/pip_vfb-prod/372/console e.g.

ERROR: An error has occurred.. 20719.241 sec
java.util.concurrent.ExecutionException: org.neo4j.graphdb.QueryExecutionException: Property values can only be of primitive types or arrays thereof

Also @Robbie1977 - is it possible to recover this file. file:/relationship_in_register_with.txt ?

I'd be very surprised if the offending lines in this file do not => an error like this: Neo.DatabaseError.Statement.ExecutionFailed. But I don't see any sign of a log message reporting this. I have a quick test I can run locally, given CSV (one example line would do) + Cypher. If this is being buried, we need to unbury it and use it to trigger a build fail. Also - if these error messaged are being buried, then revealing them would make debugging much easier.

matentzn commented 4 years ago

I am afraid I am cleaning those up (to save disk space, and not confuse neo4j2owl when running multiple times); I could probably add a function that zips these files up so they can be moved after the run ist finished.

dosumis commented 4 years ago

Don't worry for now if too much faff to store . The main things I want fixed:

With these in place we can chase up other build errors we've missed to date that are visible in the log.

@Robbie1977 - are you happy that this is an efficient approach?

matentzn commented 4 years ago

Test whether malformed Cypher errors are returned (could test with any malformed cypher) and if not, make sure they are (assuming possible - would be surprised if not, but API used is different from REST API I'm familiar with) -> I assume you mean in the context of the plugin? I am 99% sure that yes, but I will check it again. @dosumis

matentzn commented 4 years ago

See also: https://neo4j.com/developer/kb/how-do-i-use-load-csv-with-data-including-quotes/

Please double check that this is correct:

dosumis commented 4 years ago

As a general principle, I think we should avoid having to quote strings in the DB for the pipeline to run. If at all possible, the pipeline should deal gracefully with quoting as needed during loading, leaving the strings in databases and triplestores unquoted. Anything else leads to mistmatches in quoting needs between pipeline steps, making the whole system brittle.

dosumis commented 4 years ago

Is the quoting required in OWL?

matentzn commented 4 years ago

In some important serialisations yes -> most importantly ttl

dosumis commented 4 years ago

Escaping of " with \ is not required in Neo. Better if we can avoid, otherwise it needs to be stripped out before converting to JSON.

matentzn commented 4 years ago

Fixing this ticket was really tricky for me in the end (don't get fooled by the solution being only two lines of code), and the solution leaves one with a grumble of dissatisfaction. But it is the best I could come up with. The problem was that for CSV import in neo4j, both " and \ can be used as escapes for ", like "" or \". The pipeline was failing last week because some FBBT classes have quotes in definitions, which themselves get encoded JSON, see for example:

MATCH (n:Class {iri:'http://purl.obolibrary.org/obo/FBbt_00001954'}) RETURN n.definition

So escaping " with "" lead to this construct in the FBBT definition \"" which caused the CSV importer to fail because of a misplaced quote sign. Anyways, the new solution is simply this:

 val = val.replaceAll("(?<![\\\\])[\"]","\"\"");
 val = val.replaceAll("[\\\\]+[\"]","\\\\\\\\\"\"");

The first regex will escape every quote that is not preceeded by a \ like so: "". The second regex will find all cases where a quote is (and here comes a slight hack) preceded by one or more slashes, and turns that into \\"".

I think this catches all cases apart from one (which I just could not figure out:

This is an example string that by some reason has more than one \\\\\\\" followed by a quote.

This would be translated to:

This is an example string that by some reason has more than one \\"" followed by a quote.

And roundtrip to:

This is an example string that by some reason has more than one \" followed by a quote.

Hope that is acceptable. The regex required to do this right would be something like "if a quote is preceded by a number of backslashes, double the number of backslashes" -> now idea how you would do that.

matentzn commented 4 years ago

Writing all that led me to wonder why I dont just do:

val = val.replaceAll("[\"]","\"\"");
val = val.replaceAll("[\\\\]","\\\\\\\\");

separately escaping quotes and backslashes.. Unit tests pass, I will try now on whole VFB... Sigh. Not my strong suite.

matentzn commented 4 years ago

Well, this was probably my dumbest 3 wasted hours of the year! The last suggestion of course works normally. Its a bit of a pity that neo4j java api does not contain a class for csv escaping, but I guess this covers it! Seems to work, try it! See for example:

MATCH (n:Class {iri:'http://purl.obolibrary.org/obo/FBbt_00001954'}) RETURN n.definition
dosumis commented 4 years ago

So looks like all code that uses these now needs to de-escape quoting of JSON keys and values. I can live with that.

Quick question - why did you avoid the double double quoting strategy in loading? This leaves the JSON intact. (You may still need backslash escaping inside values, but at least this wouldn't mess with loading.)

matentzn commented 4 years ago

See slack, for more discussion about this, but the valid JSON string:

"X":"HEllo this is \"OMEGA\""

currently has the escape because its JSON; is this what you mean? This gets escaped to this for CSV loading:

""X"":""HEllo this is \\""OMEGA\\""""