ge-high-assurance / RACK

DARPA's Automated Rapid Certification of Software (ARCOS) project called Rapid Assurance Curation Kit (RACK)
BSD 3-Clause "New" or "Revised" License
20 stars 6 forks source link

Entity Resolution #132

Closed russell-d-e closed 9 months ago

russell-d-e commented 4 years ago

While analyzing the material provided TA4 a re-occurring situation is presenting itself where the same entities are referenced in multiple sources documents.
As an example: You may have a Requirement Document (ReqDoc1). In this document you have section that defines a REQUIREMENT, the associated text, and some trace-ablity information, something like: [R-1] System shall do something… Higher Level Req-1 Lower Level Req-1 Lower Level Req-2

There may be trace matrix as appendix shows just the tracing for items: Higher Level Req-1 : R-1, R-2, R-3 Additionally there may be separate documents (ReqDoc2) that cap some of the same information for example a Lower Level Req document: LLR-1 Software shall do something… System Spec R-1 Code File 1 With its own trace matrix: System Spec R-1: LLR-1, LLR-2

And maybe a test description documents (VerDoc) that shows this traceability to verification: R-1 --LLR-1 ----Test-1 --LLR-2 ----Test-2
We need to address how we are going to handle this entity resolution, so that within the knowledge database we can associated that:

A key component to this is that, as shown in the example above, there likely will not be exact matches between the documents for any type of unique key that is common for all documents. From what we have seen from the TA4 docs there are some cases if differing casing (Scope v SCOPE), abbreviations (IO v I/O) and a multiple of other subtle and not so subtle differences between unique identifiers in the documents.

In the TA4 ingestion meeting today @glguy , Paul Cuddihy and I discussed this and are thinking that likely the best way to handle this is to create a mutli-step process, where we could ingest data with “unresolved entities” then after all documents are loading a resolution phase will go through and make the associations between then unresolved entities (automatically or with assistance from a user).

glguy commented 4 years ago

Questions I've been thinking about:

Linking entities

We've talked about two different approaches to actually linking entities across ingestion.

While defining an equivalence relation would preserve the most data I think that it would probably make writing any query on the data next to impossible.

If we post-process the data replacing links to temporary unresolved instances with links to canonical references we'll want to be careful to record that action somehow in order to assist with traceability of the action and to help with future entity resolution of related data.

We might want a property to mark instances that we know are temporary. We might want another property listing all the unique identifiers that have been merged into a canonical instance.

We probably want to pick a design that doesn't create a shadow copy of subtypes of the whole ontology specific to unresolved references.

Tool development

Whatever we do we'll probably want to generate some tools and queries that assist in finding unresolved links between data. This will probably involve a large number of heuristics.

russell-d-e commented 4 years ago

I have been thinking about how we might be able to perform this entity resolution and really it seems like a three step high level process is all it would take.

  1. Extract Data from Individual Sources

Each document is processed individually to create subgraphs representing the data in the document. These are separate subgraphs and all entities are distinct

image

  1. Entity Association

For the second step relationships are added to associate entities from different sources together (sameAs). Direction should be from “prime” to “reference”, prime being the original version of the entity.

image

Associations could be auto created or produced manually defined in the event they cannot be found automatically.

Potentially SADL rules could be created that would be used for the reasoning for the creation of these relations

image

Even better would use project specific subclasses for requirement types

image

  1. Model Simplification

Final step would be to produce a “simplified” model that collapses the sameAs relationships into a single ENTITY. The “Prime” entity would be the content that remains, while the other content relationships would be removed or replaced by a different reference's relationship This model would then be the one that is used for queries.

image

glguy commented 4 years ago

Could you say more about the mechanism by which we implement step 3?

glguy commented 4 years ago

Is this the correct implementation of sameAs?

sameAs describes THING with values of type THING.

edit: fixed THING

russell-d-e commented 4 years ago

@glguy Yes, your definition of sameAs is exactly what I was thinking. Although, I think with the latest ontology it would be THING not DATA

sameAs describes THING with values of type THING

For step 3 that is really where the effort would lie, but off the top of my head we could have a second model for the simplified version. I am not 100% sure but I think you would able write a set of sparlq queries that would do the conversion so it would just be a matter of running a set of queries. The basic steps for this would be: 1) Empty Simplified Model 2) Copy Entire Input Model to Simplified Model 2) Insert a derived THING for each sameAs relationship, copy relations for each sameAs relationship to the derived THING 3) delete original THINGs from the Simplified Model

To me the benefit to this approach would you always have an original Source Material unchanged, so changes to the source material would just require re-generating the simplified model.

Obviously this simplistic approach would potentially run into performance problems with a large model. as you would be re-running the simplification logic on the same elements each time. This may not be an issue if this take 5 minutes, this would be an issue of this take 5 weeks, or even 5 hours.

This approach would run into some complications if TA3 was wanting to feed data back into the model based from the Simplified Model. I don't think it would be insurmountable but you would then have to copy/expand the data added to the simplified Model back into the original model. Again I haven't actually done this but it conceptually seems possible with a series of queries. 1) identify all THINGs in the simplified model that are not derived THINGs and not in the source model 2) add all things identified in 1 3) add relationships for all the THINGs created in 2 to items in corresponding original THINGS

davearcher commented 4 years ago

Just joining this party a little late. The ingest-connect-collapse idea seems good, but I think it might be good to do entity resolution before ingest. Several reasons for the "resolve early and often" approach:

1) If not careful, in step 2 we could create a boatload (yes, that's a technical term) of extra SameAs relationship instance, and it's not quite clear how you'd terminate the association loop 2) If we resolve after ingest, then for some period of time the database is in an unclean state where queries can't be run. Think commits in your usual RDBMS, but more evil because regular queries would miss associations that are important 3) If we resolve pre-ingest, we get the benefit that whoever is ingesting is standing right there and can help with any uncertainties in the entity resolution process

Another thing to say is that in step 2, the reasoning revolves around "if unique_identifier == unique_identifier". But if it was that easy, we'd be done already I think. We'd have to define a fuzzy matching algorithm for each entity class, probably based on things such as edit distances of various attribute pairs. That's an interesting research problem that we should probably look at. Or, I might be missing something here...

/d

On Tue, Aug 25, 2020 at 11:21 AM russell-d-e notifications@github.com wrote:

Yes, your definition of sameAs is exactly what I was thinking. Although, I think with the latest ontology it would be THING not DATA

sameAs describes THING with values of type THING

For step 3 that is really where the effort would lie, but off the top of my head we could have a second model for the simplified version. I am not 100% sure but I think you would able write a set of sparlq queries that would do the conversion so it would just be a matter of running a set of queries. The basic steps for this would be:

  1. Empty Simplified Model
  2. Copy Entire Input Model to Simplified Model
  3. Insert a derived THING for each sameAs relationship, copy relations for each sameAs relationship to the derived THING
  4. delete original THINGs from the Simplified Model

To me the benefit to this approach would you always an original Source Material unchanged, so changes to the source material would re-generate the simplified model.

Obviously this simplistic approach would potentially run into performance problems with a large model. as you would be re-running the simplification logic on the same elements each time. This may not be an issue if this take 5 minutes, this would be an issue of this take 5 weeks, or even 5 hours.

This approach would run into some complications if TA3 was wanting to feed data back into the model based from the Simplified Model. I don't think it would be insurmountable but you would then have to copy/expand the data added to the simplified Model back into the original model. Again I haven't actually done this but it conceptually seems possible with a series of queries.

  1. identify all THINGs in the simplified model that are not derived THINGs and not in the source model
  2. add all things identified in 1
  3. add relationships for all the THINGs created in 2 to items in corresponding original THINGS

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ge-high-assurance/RACK/issues/132#issuecomment-680192938, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACJXTP3GKX7BAE7RMSKIDYLSCP6KPANCNFSM4QFIDURQ .

-- Dave Archer, PhD https://galois.com/team/david-archer/ Principal Scientist Galois, Inc. https://www.galois.com - 421 SW 6th Avenue, Suite 300, Portland, OR 97204 email dwa@galois.com (use Virtru for privacy assurance) SMS/phone +1(503)701-0235 (use Signal for privacy assurance)

russell-d-e commented 4 years ago

Just joining this party a little late. The ingest-connect-collapse idea seems good, but I think it might be good to do entity resolution before ingest. Several reasons for the "resolve early and often" approach:

I my thinking is this whole process would be the ingestion. so while I separated them into unique steps I was thinking of them as really being part of the same ingestion process, not separate actions that a user would have to take. so I think this would meet the 'resolve early and often' that you suggested.

1) If not careful, in step 2 we could create a boatload (yes, that's a technical term) of extra SameAs relationship instance, and it's not quite clear how you'd terminate the association loop

agreed, but this to me really fall into GIGO philosophy. I would expect us to report an error if we ran into something that we could not handle, like a 'sameAs' loop (A --sameAs->B --sameAs-> C --sameAs-> A). We would handle that by creating an error informing the user that they need to fix there garbage.

2) If we resolve after ingest, then for some period of time the database is in an unclean state where queries can't be run. Think commits in your usual RDBMS, but more evil because regular queries would miss associations that are important 3) If we resolve pre-ingest, we get the benefit that whoever is ingesting is standing right there and can help with any uncertainties in the entity resolution process

Again this is why I was thinking that the resolution happens as part of the ingestion not something at a some latter date. My biggest concern with this type of approach would be how quickly this entity resolution can be performed. Nothing would be worse than having a ingestion process that has repeated 10 seconds of action followed by 10 minutes of waiting. In many ways I see this as somewhat analogous to a SW build process. Building a SW application has multiple unique tasks that are performed, (pre-processing, compiling, linking) but typically the human only kicks off the build, and only has to be involved after that if a problem is encountered.

Another thing to say is that in step 2, the reasoning revolves around "if unique_identifier == unique_identifier". But if it was that easy, we'd be done already I think. We'd have to define a fuzzy matching algorithm for each entity class, probably based on things such as edit distances of various attribute pairs. That's an interesting research problem that we should probably look at. Or, I might be missing something here...

This is just included as a simple, example in practice i would expect there to be a bit more to the rule. However, there may not be as much as you would think. Typically we are going to be extracting this information from very rigidly formed documents. The differences between sources unique identifiers are going to be things like added prefixes. The source materials are going to have clearly defined relations between items that we should be able to exploit. The situation we would most likely run into that is not very formulaic is things like typos or slight differences in spelling (I/O v. IO) and I would expect these be rather few and far between. These may be best served by addressing them with ingestion rules, and are reported out at the end, to say could not resolve a entity relation. Then allow the user to feed back information to correct that.

davearcher commented 4 years ago

Perfect! I could see rolling out entity resolution criteria for diverse entity classes in the data model over time (but not over TOO long a time), to get buy-in from TA1s and then to release them into RACK tools as part of the ingest process...

On Tue, Aug 25, 2020 at 2:03 PM russell-d-e notifications@github.com wrote:

Just joining this party a little late. The ingest-connect-collapse idea seems good, but I think it might be good to do entity resolution before ingest. Several reasons for the "resolve early and often" approach:

I my thinking is this whole process would be the ingestion. so while I separated them into unique steps I was thinking of them as really being part of the same ingestion process, not separate actions that a user would have to take. so I think this would meet the 'resolve early and often' that you suggested.

  1. If not careful, in step 2 we could create a boatload (yes, that's a technical term) of extra SameAs relationship instance, and it's not quite clear how you'd terminate the association loop

agreed, but this to me really fall into GIGO philosophy. I would expect us to report an error if we ran into something that we could not handle, like a 'sameAs' loop (A --sameAs->B --sameAs-> C --sameAs-> A). We would handle that by creating an error informing the user that they need to fix there garbage.

  1. If we resolve after ingest, then for some period of time the database is in an unclean state where queries can't be run. Think commits in your usual RDBMS, but more evil because regular queries would miss associations that are important
  2. If we resolve pre-ingest, we get the benefit that whoever is ingesting is standing right there and can help with any uncertainties in the entity resolution process

Again this is why I was thinking that the resolution happens as part of the ingestion not something at a some latter date. My biggest concern with this type of approach would be how quickly this entity resolution can be performed. Nothing would be worse than having a ingestion process that has repeated 10 seconds of action followed by 10 minutes of waiting. In many ways I see this as somewhat analogous to a SW build process. Building a SW application has multiple unique tasks that are performed, (pre-processing, compiling, linking) but typically the human only kicks off the build, and only has to be involved after that if a problem is encountered.

Another thing to say is that in step 2, the reasoning revolves around "if unique_identifier == unique_identifier". But if it was that easy, we'd be done already I think. We'd have to define a fuzzy matching algorithm for each entity class, probably based on things such as edit distances of various attribute pairs. That's an interesting research problem that we should probably look at. Or, I might be missing something here...

This is just included as a simple, example in practice i would expect there to be a bit more to the rule. However, there may not be as much as you would think. Typically we are going to be extracting this information from very rigidly formed documents. The differences between sources unique identifiers are going to be things like added prefixes. The source materials are going to have clearly defined relations between items that we should be able to exploit. The situation we would most likely run into that is not very formulaic is things like typos or slight differences in spelling (I/O v. IO) and I would expect these be rather few and far between. These may be best served by addressing them with ingestion rules, and are reported out at the end, to say could not resolve a entity relation. Then allow the user to feed back information to correct that.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ge-high-assurance/RACK/issues/132#issuecomment-680268790, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACJXTP66LSRCMHQ5YUTJW53SCQRKVANCNFSM4QFIDURQ .

-- Dave Archer, PhD https://galois.com/team/david-archer/ Principal Scientist Galois, Inc. https://www.galois.com - 421 SW 6th Avenue, Suite 300, Portland, OR 97204 email dwa@galois.com (use Virtru for privacy assurance) SMS/phone +1(503)701-0235 (use Signal for privacy assurance)

glguy commented 4 years ago

(more commentary)

I think it would be good if we could distinguish external references from definitions.

See this in action with C distinguishing between int global; and extern int global;. It's important to know if you're defining something or just trying to refer to it. If we're going to have a bunch of different ingestion tools it's going to be likely that they won't be good at guessing the URIs (or whatever unique identifier) we land on. We certainly don't want to guess these identifiers only to accidentally create matches that shouldn't.

We might be doing ingestion with the aid of RACK. This might enable us to resolve entities during ingestion. When we're doing this we're going to want to use the definitive URI in our data ingestion. Knowing which are the definitions and which are the references will help with this.

I think it would be good if we could distinguish unresolved external references from resolved external references.

I'd like our tooling to be able to find references that still need to be matched up to some ingested entity.

Separating external references helps us avoid having to delete data during resolution

If we're setting properties on external references to help us find the matching definition we can leave those properties set even after we find the match. Later we can use them to answer the question of why did we believe this reference resolved to this definition?

If we have a very strong, symmetric, transitive relationship for same as then we'll need to remove any properties from our external references when we link to the definitions to avoid those properties pointing back to the definition.

If we leave the external references around we can use them to speed up entity resolution when we rerun an ingestion tool. The external references might actually resolve to the same unique identifier.

davearcher commented 4 years ago

This idea about separating external references from definitions might be a good use of the PROV construct alternateOf(e1, e2)

However, that approach doesn't solve the unresolved external references problem. We'd still need to know when a thing was a reference, even if the true thing was missing. I suppose we could, if we knew X was an extern, use an abomination such as alternateOf(X,X) to do that job...?

On Tue, Aug 25, 2020 at 4:45 PM Eric Mertens notifications@github.com wrote:

(more commentary) I think it would be good if we could distinguish external references from definitions.

See this in action with C distinguishing between int global; and extern int global;. It's important to know if you're defining something or just trying to refer to it. If we're going to have a bunch of different ingestion tools it's going to be likely that they won't be good at guessing the URIs (or whatever unique identifier) we land on. We certainly don't want to guess these identifiers only to accidentally create matches that shouldn't.

We might be doing ingestion with the aid of RACK. This might enable us to resolve entities during ingestion. When we're doing this we're going to want to use the definitive URI in our data ingestion. Knowing which are the definitions and which are the references will help with this. I think it would be good if we could distinguish unresolved external references from resolved external references.

I'd like our tooling to be able to find references that still need to be matched up to some ingested entity. Separating external references helps us avoid having to delete data during resolution

If we're setting properties on external references to help us find the matching definition we can leave those properties set even after we find the match. Later we can use them to answer the question of why did we believe this reference resolved to this definition?

If we have a very strong, symmetric, transitive relationship for same as then we'll need to remove any properties from our external references when we link to the definitions to avoid those properties pointing back to the definition.

If we leave the external references around we can use them to speed up entity resolution when we rerun an ingestion tool. The external references might actually resolve to the same unique identifier.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ge-high-assurance/RACK/issues/132#issuecomment-680322860, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACJXTP4XUGB6GSCTT5KEZY3SCREJ3ANCNFSM4QFIDURQ .

-- Dave Archer, PhD https://galois.com/team/david-archer/ Principal Scientist Galois, Inc. https://www.galois.com - 421 SW 6th Avenue, Suite 300, Portland, OR 97204 email dwa@galois.com (use Virtru for privacy assurance) SMS/phone +1(503)701-0235 (use Signal for privacy assurance)

russell-d-e commented 4 years ago

I created a Wiki page to capture a design/information related to this issue.