RobokopU24 / ORION

Code that parses datasets from various sources and converts them to load graph databases.
MIT License
12 stars 13 forks source link

Graph Validation #40

Open cbizon opened 3 years ago

cbizon commented 3 years ago

What are the elements of a graph that we can automatically validate?

Does KGX have biolink validation?

PhillipsOwen commented 3 years ago

i can confirm that the number of nodes/edges loaded into neo4j is the same as the number in the kgx files. that is reported on every kgx graph import result.

we mostly rely on the node/edge normalization processes to return valid predicates and labels. that seems to be the best place to insure biolink compatibility

Note: deepak got back to me today informing me that he has installed a "graph summary" report in kgx that we may be able to leverage. i also have another question into Deepak for the biolink validation questions noted above.

PhillipsOwen commented 3 years ago

Deepak indicates that there is some basic validation that may be leveraged in KGX.

https://kgx.readthedocs.io/en/latest/examples.html#validate

cbizon commented 3 years ago

Just double checking:

i can confirm that the number of nodes/edges loaded into neo4j is the same as the number in the kgx files.

You mean that you can write code that confirms this at load time?

we mostly rely on the node/edge normalization processes to return valid predicates and labels. that seems to be the best place to insure biolink compatibility

This will ensure some forms of compatibility, (categories and predicates), but it will not help in any way on checking domains and ranges.

PhillipsOwen commented 3 years ago

the old KGX prints out the number of node/edges it inserts into the graph based on the data that comes in. that does not necessarily indicate that all of what came in made it to the graph.

i plan on looking at the 2 enhancements deepak mentioned today on biolink validaton and enhanced reporting. my hope is that these may have some actionable output we can use programmatically.

also note that the load manager has some metadata about the data services raw data parse that will give us some better insight to the quality of the parsing from that perspective.

cbizon commented 3 years ago

OK, but just to be as clear as possible: we want automated verification of all aspects that are amenable to such. Printing stuff out is not the same.

PhillipsOwen commented 3 years ago

i understand. i will see about loading in some smaller datasets into kgx tomorrow to see what we can pull out.

cbizon commented 2 years ago

One aspect of this is doing biolink validation a la KGX

cbizon commented 2 years ago

Another is checking provenance sources #97

cbizon commented 2 years ago

Another is checking that each edge has the appropriate validation properties #105