CategoricalData / CQL

Categorical Query Language IDE
http://categoricaldata.net
299 stars 23 forks source link

How to constrain the id of a generator given an existing instance? #46

Closed o1lo01ol1o closed 4 years ago

o1lo01ol1o commented 4 years ago

Sorry if this is not the place for these questions, but I couldn't figure this out from the examples.

I have a dataset from postgres that has a number of columns defined as non-sequential, integer-valued enumerations. Elsewhere, there is documentation on each of element of these enumerations. How can I import the data from jdbc and ensure that the appropriate generators use the correct id values for the correct multi equations?

For example, say I had the following:


schema S = literal : sql {
    entities
        Foo 
        Bar
    foreign_keys
        bar_type : Foo -> Bar
    attributes
        bar_type_description : Bar -> Varchar 
 }

instance J = import_jdbc "jdbc:postgresql://localhost/foo"  : S {
    Foo -> 
        "
        SELECT row_number() OVER () as id
          , cast(bar_type as integer) as bar_type
          from foo
        "
        Bar -> "
           SELECT cast(bar_type as integer) as id,
           null as bar_type_description 
           from foo
           "
}

instance BarTypes = literal : S {
    imports 
        J
    generators 
        b1 b3 b1111 : Bar
    multi_equations 
    bar_type_description -> {b1 Bur, b3 Bhar, b1111 Bargh}

}

And in the data bar_type is only ever one of 1, 3, 1111; I'd like to make sure that the multi equations provide the correct string for the correct id.

wisnesky commented 4 years ago

Ah, good question: the answer is, you can’t. CQL will not allow you to use generator names as data, as that would break the principle that no CQL query can distinguish isomorphic databases.

Instead, anything that you want to be treated as data must be imported as an attribute. Think of the “id” column being imported as a “meaningless autogenerated row identifier” used only to connect foreign keys. If you want to get access to the value, you’ll need an attribute, which you can then populate using the id column. As your data migrates, these generators will change, but the value held in the attribute will not.

It looks like the description is a derived attribute. There’s a number of ways to approach your problem, but probably the simplest way, based on your description of only having a few input types and having conditional logic, is to define a new typeside, say sql’, that includes a function lookup : Varchar -> Varchar (or similar). Its body would say something like “if input[0] == 1 then Alice else if input[0] == 2 then bob else throw error” or similar. Anywhere that CQL accepts the sql typeside it should also accept sql’; if not, please report.

Then, you import your data as you are doing now, not onto S, but S on the sql’ typeside. Then you create a new schema S’, importing S, with an observation equation saying “forall x. x.derived = lookup(x.id_as_att)”, and perform a sigma of the imported data along the inclusion mapping from S to S’. That’s a general technique for populating derived attributes, although there are others (for example, we could write CQL ‘constraints’ instead of a typeside function). In fact, the attributes need not be derived; you can use this technique to e.g. solve simultaneous systems of equations (although this will stress the automated theorem prover, of course).

On Jul 10, 2020, at 9:45 AM, Tim Pierson notifications@github.com wrote:

Sorry if this is not the place for these questions, but I couldn't figure this out from the examples.

I have a dataset from postgres that has a number of columns defined as non-sequential, integer-valued enumerations. Elsewhere, there is documentation on each of element of these enumerations. How can I import the data from jdbc and ensure that the appropriate generators use the correct id values for the correct multi equations?

For example, say I had the following:

schema S = literal : sql { entities Foo Bar foreign_keys bar_type : Foo -> Bar attributes bar_type_description : Bar -> Varchar }

instance J = import_jdbc "jdbc:postgresql://localhost/timpierson?setSchema=govt_usa_fed_cfpb_hmda" : S { Foo -> " SELECT row_number() OVER () as id , cast(bar_type as integer) as bar_type from foo " Bar -> " SELECT cast(bar_type as integer) as id, null as bar_type_description from foo " }

instance BarTypes = literal : S { imports J generators b1 b3 b1111 : Bar multi_equations bar_type_description -> {b1 Bur, b3 Bhar, b1111 Bargh}

} And in the data bar_type is only ever one of 1, 3, 1111; I'd like to make sure that the multi equations provide the correct string for the correct id.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/CategoricalData/CQL/issues/46, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2QKNZ5TNFYKU4OR4JXTSLR25ARBANCNFSM4OWYD5EA.

wisnesky commented 4 years ago

Follow-up: if the lookup function is encoded in the input database, you could instead import onto the schema S’ containing the observation equation, and cql will check that the imported data satisfies the equation. The key point may be more about the typeside than the sigma migration, or vice versa, depending on your ultimate goal and input data.

On Jul 10, 2020, at 9:45 AM, Tim Pierson notifications@github.com wrote:

Sorry if this is not the place for these questions, but I couldn't figure this out from the examples.

I have a dataset from postgres that has a number of columns defined as non-sequential, integer-valued enumerations. Elsewhere, there is documentation on each of element of these enumerations. How can I import the data from jdbc and ensure that the appropriate generators use the correct id values for the correct multi equations?

For example, say I had the following:

schema S = literal : sql { entities Foo Bar foreign_keys bar_type : Foo -> Bar attributes bar_type_description : Bar -> Varchar }

instance J = import_jdbc "jdbc:postgresql://localhost/timpierson?setSchema=govt_usa_fed_cfpb_hmda" : S { Foo -> " SELECT row_number() OVER () as id , cast(bar_type as integer) as bar_type from foo " Bar -> " SELECT cast(bar_type as integer) as id, null as bar_type_description from foo " }

instance BarTypes = literal : S { imports J generators b1 b3 b1111 : Bar multi_equations bar_type_description -> {b1 Bur, b3 Bhar, b1111 Bargh}

} And in the data bar_type is only ever one of 1, 3, 1111; I'd like to make sure that the multi equations provide the correct string for the correct id.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/CategoricalData/CQL/issues/46, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2QKNZ5TNFYKU4OR4JXTSLR25ARBANCNFSM4OWYD5EA.

o1lo01ol1o commented 4 years ago

Cool, the sigma approach seems to have worked.

Then you create a new schema S’, importing S, with an observation equation saying “forall x. x.derived = lookup(x.id_as_att)”, and perform a sigma of the imported data along the inclusion mapping from S to S’.

In my mapping S -> S', I seem to be required to map the attribute id_as_att -> id_as_att. Is there a way to forget this attribute in the mapping? Since I already have x.derived with the looked-up value, there's no need to keep the Integer around any more.

wisnesky commented 4 years ago

Sure, you could delta along the inclusion mapping S’-without-id_as_att -> S’ to forget id_as_att. Constructing S’-without-id_as_att might be a little tricky since there is no CQL schema section that “de imports”, but you might be able to create a common schema X that is common to all the schemas involved, including S’-without-id_as_att. In this particular case, it may also be possible to go from S to S’-without-id_as_att directly using a CQL query.

On Jul 10, 2020, at 3:25 PM, Tim Pierson notifications@github.com wrote:

Cool, the sigma approach seems to have worked.

Then you create a new schema S’, importing S, with an observation equation saying “forall x. x.derived = lookup(x.id_as_att)”, and perform a sigma of the imported data along the inclusion mapping from S to S’.

In my mapping S -> S', I seem to be required to map the attribute id_as_att -> id_as_att. Is there a way to forget this attribute in the mapping? Since I already have x.derived with the looked-up value, there's no need to keep the Integer around any more.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/CategoricalData/CQL/issues/46#issuecomment-656920508, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2QKN5LYHUL2MASO5CWGF3R26IMBANCNFSM4OWYD5EA.

o1lo01ol1o commented 4 years ago

Excellent, thanks so much for the help!

o1lo01ol1o commented 4 years ago

I followed the sigma approach but am encountering quite poor performance. Here's an abridged gist

The above was run on an abridged version of the 2018-2019 dataset: 10k rows of all varchar columns. ~4mb for the whole table. The CQL IDE allocates ~ 2.7gb of heap and has been running 20 minutes on the final sigma instance on a 2019 Macbook pro. The actual dataset I'm working on is 75mm rows of the same data. The goal is simply to unify tables from 2014-2019, provided explicit ref tables for the numeric codes, and create new entities that can use to integrate other datasets (ie, US census tracts, state codes, etc.). Does the linked implementation do anything it should not?

wisnesky commented 4 years ago

You might try sigma_chase, although I’m not sure of the current state of it in open-source CQL. Sigma_chase is a version of sigma that does not construct a decision procedure and for that reason has better performance. Open-source CQL is due for a refresh of sigma_chase, but the update is only 90% complete (it’s not quite ready yet).

On Jul 20, 2020, at 7:17 AM, Tim Pierson notifications@github.com wrote:

I followed the sigma approach but am encountering quite poor performance. Here's an abridged gist https://gist.github.com/o1lo01ol1o/05c68446e1a8c4a89827383c7e6d736e The above was run on an abridged version of the 2018-2019 dataset: 10k rows of all varchar columns. ~4mb for the whole table. The CQL IDE allocates ~ 2.7gb of heap and has been running 20 minutes on the final sigma instance on a 2019 Macbook pro. The actual dataset I'm working on is 75mm rows of the same data. The goal is simply to unify tables from 2014-2019, provided explicit ref tables for the numeric codes, and create new entities that can use to integrate other datasets (ie, US census tracts, state codes, etc.). Does the linked implementation do anything it should not?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/CategoricalData/CQL/issues/46#issuecomment-661068131, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2QKN6VS26WS4WV4BMGVBLR4RGYLANCNFSM4OWYD5EA.

o1lo01ol1o commented 4 years ago

Ah, good: instance computation total time: 3.1s. (Whereas sigma did not complete at all.). I'll now try on real workloads and let you know how it goes.

On Mon, Jul 20, 2020 at 7:16 PM Ryan Wisnesky notifications@github.com wrote:

You might try sigma_chase, although I’m not sure of the current state of it in open-source CQL. Sigma_chase is a version of sigma that does not construct a decision procedure and for that reason has better performance. Open-source CQL is due for a refresh of sigma_chase, but the update is only 90% complete (it’s not quite ready yet).

On Jul 20, 2020, at 7:17 AM, Tim Pierson notifications@github.com wrote:

I followed the sigma approach but am encountering quite poor performance. Here's an abridged gist < https://gist.github.com/o1lo01ol1o/05c68446e1a8c4a89827383c7e6d736e> The above was run on an abridged version of the 2018-2019 dataset: 10k rows of all varchar columns. ~4mb for the whole table. The CQL IDE allocates ~ 2.7gb of heap and has been running 20 minutes on the final sigma instance on a 2019 Macbook pro. The actual dataset I'm working on is 75mm rows of the same data. The goal is simply to unify tables from 2014-2019, provided explicit ref tables for the numeric codes, and create new entities that can use to integrate other datasets (ie, US census tracts, state codes, etc.). Does the linked implementation do anything it should not?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub < https://github.com/CategoricalData/CQL/issues/46#issuecomment-661068131>, or unsubscribe < https://github.com/notifications/unsubscribe-auth/AA2QKN6VS26WS4WV4BMGVBLR4RGYLANCNFSM4OWYD5EA .

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/CategoricalData/CQL/issues/46#issuecomment-661252606, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB5DC5KCRS5WROZ6ZNLBZN3R4SCXLANCNFSM4OWYD5EA .

-- 328 Plymouth St Brooklyn, NY 11201

617 990 7876