Boeing requirements ingestion issue

AbhaMoitra commented 3 years ago

I am using the released RACK V5 version and ingest_REQUIREMENT.json from GIT on the Boeing requirements data. The 2 Boeing requirements data files can be found on the RedShare in the folder TeamWorkSpace/Abha. I loaded REQUIREMENT1.csv (283 recorts imported) and then REQUIREMENT2.csv (724 records imported) successfully. When I then query for requirements, I get 4425 results. The results have also been uploaded to RedShare in the same folder. 2 observations:

In some rows I see values in the columns wasDerivedFrom_identifier and wasImpactedBy_identifier which is surprising.
For requirements that had more than one value for "satisfies" in the input file, the number of results get all possible combination for satisfies_identifier, wasDerivedFrom_identifier and wasImpactedBy_identifier. So for rows 34-36 in REQUIREMENTS2.csv (the "satisfies" column has 1 null entry and 2 unique entrie)s; the corresponding results are in rows 20-27 of results.csv. So 2 values of "satisfies" resulted in 8 rows in the results file.

cuddihyge commented 3 years ago

SUMMARY: This looks to me like an ontology "error" or misunderstanding of some kind. It looks like SemTK is honoring the sadl. We may need more functionality for sub-property queries.

1) Note in the picture on the nodegroup canvas (right), that we ingest the "wasDerivedFrom" (about 9 o'clock in the picture) and "satisfies" (about 3 o'clock).

2) Note on the left that "satisfies" is a sub-property of "wasImpactedBy" which is a sub-property of "wasDerivedFrom"

What is happening is:

semtk ingests the "satisfies" relationship as exactly that: satisfies
when you run a select query and ask for wasDerivedFrom, semTK presumes you mean any kind of wasDerivedFrom property, which include "satisfies".

What we need to discuss:

is the ontology and it's sub-properties the way we want them to be
is there a way to form a better query: perhaps ingest_REQUIREMENT query doesn't need to work correctly as SELECT, and we're done
do we need a way to tell SemTK that a class or property in a query should be matched exactly, instead of always presuming subclasses and subproperties should be matched. This is do-able but might also increase confusion.

Just for fun: run ingest_REQUIREMENT as a construct query and you'll see the underlying data is correct, and the CONSTRUCT results display it properly.

This is a fascinating case. Any suggestions on where/how to document it? SemTK needs its own StackOverflow?

AbhaMoitra commented 3 years ago

@cuddihyge : thanks for the detailed explanation and SemTK is doing what we are asking it to do. Let me think about it. @kityansiu : FYI.

AbhaMoitra commented 3 years ago

@cuddihyge @kityansiu : It was Kit's idea to try the sparql query that we had in V4.1 and when that was done, those 2 columns (wasDerivedFrom, wasImpactedBy) are empty and we get a reasonable number of rows returned. So, it would seem that the sparql query re-writing that was done for performance improvement has resulted in this change in output.

cuddihyge commented 3 years ago

I believe it is more likely that V4.1 didn't support sub-properties yet. It isn't an improvement in performance, it is an improvement in handling sub-properties :-) I think the current version is honoring the SADL.

If we need to be able to form queries in SPARQLgraph that query ONLY a property and class (not its sub-properties and sub-classes) then that could be added. This is a moderate-sized task, so I'd recommend we consider whether that's the best solution before embarking on the improvement.

AbhaMoitra commented 3 years ago

@kityansiu : To check that the results in SemTK match running a sparql query in SADL, I did the following. I have a query in SADL that asks for "wasImpactedBy" and another query for "satisfies" as shown below. Note that we have instances where "satisfies" relationship has results while the first query does not. (I spelled out 'satisfies' in the query as there are 2 different 'satisfies'.) So, to me the SemTk behavior does not match SADL behavior - let me know if this does not illustrate that. Could it be that the results depend on which reasoner is employed?

cuddihyge commented 3 years ago

Yes. In SADL you are asking for a specific relationship: wasImpactedBy.

When you draw a nodeGroup in SemTK, it presumes that all subClasses and subProperties are also matches. So it writes more complex queries.
Where your query is: select * where {?x <wasImpactedBy> ?z} SemTK interprets the nodegroup: select * where { ?x ?prop ?z. ?prop rdfs:subPropertyOf* <wasImpactedBy> } ( I just hand-typed that so the syntax might be imperfect.)

If we've gotten to a point where it is needed, I could design an implement an override such that a SemTK class could be interpretted as "exactly this class" and a property as "exactly this property" instead of the current default behavior. Perhaps an extra check box.

This would be a reasonable and consistent. It is a moderate-sized task. I don't think we're convinced this is the best solution yet, but if we become convinced, we can put it on the board start work.

AbhaMoitra commented 3 years ago

@kityansiu: are we ok with the results as is in SemTK?

As an exercise: The SADL query results above were when I used OWL_MEM as the reasoner. I then re-ran the exact same 2 queries with OWL_MEM_RDFS reasoner and I got results for both queries. So, what we saw was not because of querying for a specific relationship, but what was the reasoner that was used. So, IF WE WANTED to alter the behavior in SemTK, is it possible to simply change the reasoner employed?

We may be ok with the results in SemTK as is - just want to know if we can change reasoner engine in SemTK?

cuddihyge commented 3 years ago

SemTK doesn't currently use a reasoner. It generates SPARQL from the nodegroup.
If we want two options (1) normal subclass/subproperty (2) exact class / exact property then I would have to change the SPARQL generator.

I would also suspect that we wouldn't want this choice to be global, but I'm not sure. I had envisioned a checkbox on a specific edge or node signifying that it should only look for exact matches.

kityansiu commented 3 years ago

Here's my take on this. From a TA3 point of view, I think we are okay with the current SemTK SPARQL; I don't see a need to specify exact class or exact property at the moment. My explanation below concurs with Paul's bullet that "ingest_REQUIREMENT query doesn't need to work correctly as SELECT".

When I query, I might at first ask for the parent class or parent property, and then as I learn more about the data, I will alter the query to ask for the exact class or exact property. In our REQUIREMENT example, I might first ask for wasImpactedBy, get a handle on what all is returned, and then modify my query to ask for mitigates, satisfies, or governs. Further, with our union query, a user can ask for combinations of these subproperties.

REQ_wasImpactedBy

Let's not rush into implementing a solution at the moment, since we haven't seen an example where this is causing problems or a major inconvenience.

AbhaMoitra commented 3 years ago

Done

ge-high-assurance / RACK

Boeing requirements ingestion issue #319