CategoricalData / CQL

Categorical Query Language IDE
http://categoricaldata.net
299 stars 23 forks source link

Null foreign keys during import #51

Closed o1lo01ol1o closed 4 years ago

o1lo01ol1o commented 4 years ago

Is there a recommended way of handling nulls in an import where I've specified a value should be a foreign key in a schema?

wisnesky commented 4 years ago

Nullable foreign keys are indeed annoying to clean up to represent in category theory, but the general approach isn’t too bad: if the the foreign keys are usually null, then those columns should probably be represented as CQL attributes. If they are not usually null, then the thing to do is import onto a schema with only attributes, and the clean the data to establish foreign keys. There’s a few different ways to do this, but the conceptually simplest would be to land onto a “combined” schema S+S’, where S’ has the foreign keys you are interested in and S does not. Then you can write a “constraints” expression C to express 1) that S’ is a copy of S, and 2) how the foreign keys in S relate the the attributes copied over from S. Then the CQL command “chase C” will create the minimal number of fresh IDs needed to make the foreign keys non-null.

There’s also a way to do this with CQL schema mappings and the ‘pi’ operation that has a more cascade_delete kind of semantics (and cascade_delete of constraints is also available). The algorithm below came out of a math paper, so take it with a grain of salt. Here C is the schema you want, and SQL(C) is the non-FK version, and C0 encodes primary keys.

Define a PK-schema to be a CQL schema mapping i: C0-->C with the properties that C0 has discrete entity side, finitely many attributes, and no equations, and i is identity on objects. For an object c in C, let c_0 be the unique object with i(c_0)=c. Note that for c in C, we think of the attributes of c_0 as the primary key columns of c. Each table in C has a unique primary key, namely the set of columns in c_0.

Given a PK-schema i: C0-->C, we form a new schema S=SQL(i) as follows: for each entity c of C, put an entity of the same name, c, in S; for each attribute att: c--> tau in C, put an attribute of the same name and type in S, i.e., att: c-->tau; for each foreign key f: c-->c' in C, and each pk column att: c'_0 --> tau on c', put an attribute f_att: c--> tau on c in S. We have effectively replaced a PK-schema C with its relational version. Now there is an induced functor q_C: SQL(C)-->C. Send each entity to itself, each attribute of the form att:c-->tau to the attribute of the same name and type in C, and each attribute of the form f_att: c-->tau to the observable (att o f): c-->c'-->tau in C. Now Pi along q_C will be what you want.

Regardless of which approach you choose, I’d happy to chat further.

On Jul 18, 2020, at 3:54 PM, Tim Pierson notifications@github.com wrote:

Is there a recommended way of handling nulls in an import where I've specified a value should be a foreign key in a schema?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/CategoricalData/CQL/issues/51, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2QKN6B77477EQJZFIMXCTR4IRY3ANCNFSM4PAPI4OA.

o1lo01ol1o commented 4 years ago

Thanks for the reply!

In this particular case, the former approach seems to be better suited since I basically need to fill-out the universe of foreign entities (concretely, I have several mailing address attribute sets, each of which has a column for states but not all address have states.)

In this particular case, since I'm not already landing the import into an attribute-only schema and there are a couple hundred attributes to map, I might just catch the nulls in the import_jdbc statement and map them to 'non-state-address' value. This is not a scalable solution in general, but maybe it will suffice.

It's becoming clear to me that in general the way to go with these import-sanitized-ETL processes is to land everything to attribute-only schemata unless I know there are no nulls in the foreign keys.

In general, are there good methods to catch data integrity errors without waiting for the instance transforms to throw errors during runtime? In the above case, I have a transformation that seems to need to run for a dozen hours, so I've been working on subsets of the tables to get the instance imports working. However, this leads to problems when a subset of rows might not contain all the necessary entities for a foreign key, or there may be a null or an un-castable string value lurking in the unseen rows.

wisnesky commented 4 years ago

My general advice would be to try to structure the CQL program so that referential integrity violations are not caught at runtime during import, but established by CQL queries or mappings. In addition, evaluating CQL queries tend to be faster than evaluating sigma or chasing constraints, and sigma is faster than the chase. A primary focus of the commercial tool is the scalable execution of both sigma and the chase by non proving methods; the open source tool operates by construction of decision procedures, which doesn’t scale well. Right now the practical limit seems to be about 10GB workloads; in our chemistry case study (https://github.com/kris-brown/cql_data_integration https://github.com/kris-brown/cql_data_integration), using eval, sigma, and cascade_delete we were able to import SQL data with about 20% missing foreign keys in a few minutes.

On Jul 19, 2020, at 2:37 AM, Tim Pierson notifications@github.com wrote:

Thanks for the reply!

In this particular case, the former approach seems to be better suited since I basically need to fill-out the universe of foreign entities (concretely, I have several mailing address attribute sets, each of which has a column for states but not all address have states.)

In this particular case, since I'm not already landing the import into an attribute-only schema and there are a couple hundred attributes to map, I might just catch the nulls in the import_jdbc statement and map them to 'non-state-address' value. This is not a scalable solution in general, but maybe it will suffice.

It's becoming clear to me that in general the way to go with these import-sanitized-ETL processes is to land everything to attribute-only schemata unless I know there are no nulls in the foreign keys.

In general, are there good methods to catch data integrity errors without waiting for the instance transforms to throw errors during runtime? In the above case, I have a transformation that seems to need to run for a dozen hours, so I've been working on subsets of the tables to get the instance imports working. However, this leads to problems when a subset of rows might not contain all the necessary entities for a foreign key, or there may be a null or an un-castable string value lurking in the unseen rows.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/CategoricalData/CQL/issues/51#issuecomment-660616797, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2QKN5RWAMTNCNQTZ2LPYTR4K5F5ANCNFSM4PAPI4OA.

o1lo01ol1o commented 4 years ago

but established by CQL queries or mappings.

That makes sense generally though I suppose when working with messy data there's just going to have to be some back and forth until things work.

the open source tool operates by construction of decision procedures, which doesn’t scale well

I've just pared down the workload of this task considerably and I'm running into some surprisingly poor performance now. I'll follow up in #46 since it's due to the workflow mentioned there.

chemistry case study (https://github.com/kris-brown/cql_data_integration https://github.com/kris-brown/cql_data_integration)

Ah, that repo has cleaned up nicely since I saw it last; it's a good point of reference!