sigma_chase performance

CategoricalData / CQL

Categorical Query Language IDE

http://categoricaldata.net

299 stars 22 forks source link

sigma_chase performance #52

Closed o1lo01ol1o closed 4 years ago

o1lo01ol1o commented 4 years ago

A follow up from #46 :

I updated the linked gist to include the coproduct of the two tables to which I'm trying to migrate.

Approx 60k rows between the two for this trial subset;
The sigma_chase version ran for 14 hours using a maximum of 30gb (!) of jvm heap before I terminated it ;
This is on: MBP 2019 i7 with 16gb RAM ; Mojave; JDK 12.

After deleting 54k rows from the above and running the program again, the process has been running for ~3 hours with up to 25gb of heap usage. It's currently processing the last sigma_chase on the forward mapping prior to the coproduct.

I did a JIT "warm-up" of a half dozen runs of the Constraints demo before starting both of the jobs and I restarted the CQL IDE in between.

Is there something wrong here or is performance on the commercial version better by 2-3 orders of magnitude?

wisnesky commented 4 years ago

Could the result of the chase actually be infinite? For example, that can happen if there is a cycle where entity X causes entity X’ to exist causes entity X’’ to exist causes… and unfortunately, this criteria is undecidable, although there are conservative approximations that are on the todo list.

That being said, I think the culprit may actually be coproduct. It functions by creating a decision procedure, like how sigma is working. The reason is that in general, the coproduct of two instances will not have a type algebra that is the coproduct of the input type algebras (e.g., when coproducting SQL DBs, you will not end up with two copies of NULL. However, coproduct will still be fast when the input instances have free type algebras, but maybe yours don’t due to the non-trivial typeside equations; I’m not sure). You might try sigma_chase along a mapping that implements a coproduct of tables instead, or even a quotient_query.

The chemistry example does use a coproduct on a 10gb data set, so I’m not really sure what is going on. You might try the version of the jarfile the author archived; maybe there’s been a regression.

On Jul 21, 2020, at 4:24 AM, Tim Pierson notifications@github.com wrote:

A follow up from #46 https://github.com/CategoricalData/CQL/issues/46 :

I updated the linked gist to include the coproduct of the two tables to which I'm trying to migrate.

Approx 60k rows between the two for this trial subset; The sigma_chase version ran for 14 hours using a maximum of 30gb (!) of jvm heap before I terminated it ; This is on: MBP 2019 i7 with 16gb RAM ; Mojave; JDK 12. After deleting 54k rows from the above and running the program again, the process has been running for ~3 hours with up to 25gb of heap usage. It's currently processing the last sigma_chase on the forward mapping prior to the coproduct.

I did a JIT "warm-up" of a half dozen runs of the Constraints demo before starting both of the jobs and I restarted the CQL IDE in between.

Is there something wrong here or is performance on the commercial version better by 2-3 orders of magnitude?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/CategoricalData/CQL/issues/52, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2QKNYMWBG6TIRJBPGBOWTR4V3FLANCNFSM4PDPKVXA.

o1lo01ol1o commented 4 years ago

I've commented out the coproduct and the modify renaming schema and the 6k rows didn't terminate in the 2+ hours I let them run. The archived CQL.jar fared no better.

Could the result of the chase actually be infinite?

It would have to be the chase result of the mapping defined by this colimit schema, but I'm not sure how it could be infinite as I don't see any referential cycles.

Really all we're doing is pulling some attributes out into entities and saying some of those entities are equal between the two schemata. The only mapping that seems slightly non trivial is when we need to make a new entity Foo from repeated attribute columns: foo_1, foo_2, foo_3. But this should just create a linking table between the main records and the universe of Foos that occur.

wisnesky commented 4 years ago

A 6k row instance non-cyclic chase should be doable even with the naive algorithm; the crazy memory usage suggests that something is spinning out of control, since CQL should only keep one copy of every term in creates (so to generate eg 30gb requires lots of new “stuff”). Can you hook CQL up to the profiler at https://visualvm.github.io https://visualvm.github.io/ and try to figure out if e.g. CQL is entering a loop and/or where the memory usage is going? Alternatively I’m happy to do it myself if you can make available a test case.

On Jul 21, 2020, at 8:36 AM, Tim Pierson notifications@github.com wrote:

I've commented out the coproduct and the modify renaming schema and the 6k rows didn't terminate in the 2+ hours I let them run. The archived CQL.jar fared no better.

Could the result of the chase actually be infinite?

It would have to be the chase result of the mapping defined by this colimit schema https://gist.github.com/o1lo01ol1o/4c32806da119d97c01ee91278de41348, but I'm not sure how it could be infinite as I don't see any referential cycles.

Really all we're doing is pulling some attributes out into entities and saying some of those entities are equal between the two schemata. The only mapping that seems slightly non trivial is when we need to make a new entity Foo from repeated attribute columns: foo_1, foo_2, foo_3. But this should just create a linking table between the main records and the universe of Foos that occur.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/CategoricalData/CQL/issues/52#issuecomment-661934922, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2QKN23SDXLHSEVZASHV6DR4WYXFANCNFSM4PDPKVXA.

o1lo01ol1o commented 4 years ago

the crazy memory usage suggests that something is spinning out of control

It might be the typesides. Just simply trying to import the stringly-typed source and map it to through the observation equations to an instance uses 110gb of heap for a 6.5gb table.

Here's ~6k of rows of and the coproduct cql discussed above.

https://www.dropbox.com/s/pw1o4upc7cdb9d5/TestCase.zip?dl=0

wisnesky commented 4 years ago

The typeside seems to be fine. I’ll try to run this through the profiler tomorrow (Weds). Does any subset of the input data complete? If not the error is probably not about scalability per se.

On Jul 22, 2020, at 2:13 AM, Tim Pierson notifications@github.com wrote:

the crazy memory usage suggests that something is spinning out of control

It might be the typesides. Just simply trying to import the stringly-typed source and map it to through the observation equations to an instance uses 110gb of heap for a 6.5gb table.

Here's ~6k of rows of and the coproduct cql discussed above.

https://www.dropbox.com/s/pw1o4upc7cdb9d5/TestCase.zip?dl=0 https://www.dropbox.com/s/pw1o4upc7cdb9d5/TestCase.zip?dl=0 — You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/CategoricalData/CQL/issues/52#issuecomment-662340862, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2QKNYRU6FTR3D7QIH3IBLR42UVDANCNFSM4PDPKVXA.

o1lo01ol1o commented 4 years ago

instance J_2018_2019_Fwd = sigma_chase (getMapping SanitizedColimit_2014_2019 S_2018_2019_lar_with_ref) J_2018_2019_lar_with_ref

Will complete (though this is only 1k lines in the abridged data). The other Fwd chase does not complete (in reasonable time, anyway).

o1lo01ol1o commented 4 years ago

Digging a little deeper, perhaps there's been some user error on my part?

If I change the typeside function derivedRace to output a constant String:

derivedRace : Integer -> Varchar = " return 'test';"

and call it in the observation equations of my transformed schema (as I understood the #46 to describe):

observation_equations 
...
forall x. raceDerived(x) = derivedRace(raceAsAtt(x))
...

The resulting schema (in csv / postgres) looks correct, however, in the refRace table there are fks to the main table where there should yet both raceAsAtt and raceDerived are null for all values:

"id","raceAsAtt","raceDerived"
"3981",,
"3982",,
"3983",,
"3984",,

When I run the query for the refRace entity against the database, I do indeed get the universe of entities as Integers. Could the cql program as written be mangling something?

o1lo01ol1o commented 4 years ago

For the sake of clarity, the above comment refers to the instance created by the first mapping S to S’ as discussed in #46

o1lo01ol1o commented 4 years ago

Sorry for the deluge of info: changing to sigma on the above mapping throws some typeside conversion errors on asDouble; fixing them then generates ref tables with the correct values.

However, this is fairly slow as discussed and changing back to sigma_chase results in null valued tables again.

o1lo01ol1o commented 4 years ago

(And, again for clarity, the last 3 comments refer to this cql program, which is a subset of the test case I linked. https://gist.github.com/o1lo01ol1o/259373e9564d520928ad0671561f0745)

wisnesky commented 4 years ago

It’s certainly possible that eg a cascade of internal typeside errors is leading to the performance degradation as CQL has to try-catch each row on import, but regardless my instinct is that something is going wrong internally even in that case, and that this isn’t user error - there just shouldn’t been so many GB worth of stuff that can even be created, period, so I’m curious to see what it is. Updates to follow.

On Jul 22, 2020, at 4:44 AM, Tim Pierson notifications@github.com wrote:

(And, again for clarity, the last https://github.com/CategoricalData/CQL/issues/52#issuecomment-662393995 3 https://github.com/CategoricalData/CQL/issues/52#issuecomment-662396022 comments https://github.com/CategoricalData/CQL/issues/52#issuecomment-662402102 refer to this cql program, which is a subset of the test case I linked. https://gist.github.com/o1lo01ol1o/259373e9564d520928ad0671561f0745 https://gist.github.com/o1lo01ol1o/259373e9564d520928ad0671561f0745)

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/CategoricalData/CQL/issues/52#issuecomment-662406587, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2QKN4VCM72Q5OUEGXIK63R43GK7ANCNFSM4PDPKVXA.

wisnesky commented 4 years ago

Follow-up: do you get any console output? When I run your file sans data on my development version, I see:

java.lang.RuntimeException: In javascript execution of _x (of class java.lang.String) cannot convert to Integer error: For input string: "_x" Possible fix: check the java_constants of the typeside for type conversion errors. at catdata.aql.AqlJs.parse(AqlJs.java:147) at catdata.aql.RawTerm.infer_good(RawTerm.java:302) at catdata.aql.RawTerm.infer_good(RawTerm.java:153) at catdata.aql.RawTerm.infer1x(RawTerm.java:353) at catdata.aql.exp.MapExpRaw.eval0(MapExpRaw.java:369)

Which is unexpected. But at the same time, it seems to run when pressing the run button. So I’m wondering if maybe there’s a mapping error induced by the inferencer and not caught by the type checker, that manifests as a bunch of silently caught errors inside the typeside, slowing everything down..

On Jul 22, 2020, at 10:08 AM, Ryan Wisnesky ryan@conexus.com wrote:

It’s certainly possible that eg a cascade of internal typeside errors is leading to the performance degradation as CQL has to try-catch each row on import, but regardless my instinct is that something is going wrong internally even in that case, and that this isn’t user error - there just shouldn’t been so many GB worth of stuff that can even be created, period, so I’m curious to see what it is. Updates to follow.

On Jul 22, 2020, at 4:44 AM, Tim Pierson <notifications@github.com mailto:notifications@github.com> wrote:

(And, again for clarity, the last https://github.com/CategoricalData/CQL/issues/52#issuecomment-662393995 3 https://github.com/CategoricalData/CQL/issues/52#issuecomment-662396022 comments https://github.com/CategoricalData/CQL/issues/52#issuecomment-662402102 refer to this cql program, which is a subset of the test case I linked. https://gist.github.com/o1lo01ol1o/259373e9564d520928ad0671561f0745 https://gist.github.com/o1lo01ol1o/259373e9564d520928ad0671561f0745)

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/CategoricalData/CQL/issues/52#issuecomment-662406587, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2QKN4VCM72Q5OUEGXIK63R43GK7ANCNFSM4PDPKVXA.

wisnesky commented 4 years ago

Follow-up: seems to be the mapping for (LoanApplicationRegistrar, derivedSex) in F_2018_2019_lar that it’s struggling with.

On Jul 22, 2020, at 2:05 PM, Ryan Wisnesky ryan@conexus.com wrote:

Follow-up: do you get any console output? When I run your file sans data on my development version, I see:

java.lang.RuntimeException: In javascript execution of _x (of class java.lang.String) cannot convert to Integer error: For input string: "_x" Possible fix: check the java_constants of the typeside for type conversion errors. at catdata.aql.AqlJs.parse(AqlJs.java:147) at catdata.aql.RawTerm.infer_good(RawTerm.java:302) at catdata.aql.RawTerm.infer_good(RawTerm.java:153) at catdata.aql.RawTerm.infer1x(RawTerm.java:353) at catdata.aql.exp.MapExpRaw.eval0(MapExpRaw.java:369)

Which is unexpected. But at the same time, it seems to run when pressing the run button. So I’m wondering if maybe there’s a mapping error induced by the inferencer and not caught by the type checker, that manifests as a bunch of silently caught errors inside the typeside, slowing everything down..

On Jul 22, 2020, at 10:08 AM, Ryan Wisnesky <ryan@conexus.com mailto:ryan@conexus.com> wrote:

It’s certainly possible that eg a cascade of internal typeside errors is leading to the performance degradation as CQL has to try-catch each row on import, but regardless my instinct is that something is going wrong internally even in that case, and that this isn’t user error - there just shouldn’t been so many GB worth of stuff that can even be created, period, so I’m curious to see what it is. Updates to follow.

On Jul 22, 2020, at 4:44 AM, Tim Pierson <notifications@github.com mailto:notifications@github.com> wrote:

(And, again for clarity, the last https://github.com/CategoricalData/CQL/issues/52#issuecomment-662393995 3 https://github.com/CategoricalData/CQL/issues/52#issuecomment-662396022 comments https://github.com/CategoricalData/CQL/issues/52#issuecomment-662402102 refer to this cql program, which is a subset of the test case I linked. https://gist.github.com/o1lo01ol1o/259373e9564d520928ad0671561f0745 https://gist.github.com/o1lo01ol1o/259373e9564d520928ad0671561f0745)

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/CategoricalData/CQL/issues/52#issuecomment-662406587, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2QKN4VCM72Q5OUEGXIK63R43GK7ANCNFSM4PDPKVXA.

wisnesky commented 4 years ago

Follow-up2: derivedSex appears as both a typeside symbol and attribute name. It’s possible the type inferencer is choosing the wrong one.

On Jul 22, 2020, at 2:11 PM, Ryan Wisnesky ryan@conexus.com wrote:

Follow-up: seems to be the mapping for (LoanApplicationRegistrar, derivedSex) in F_2018_2019_lar that it’s struggling with.

On Jul 22, 2020, at 2:05 PM, Ryan Wisnesky <ryan@conexus.com mailto:ryan@conexus.com> wrote:

Follow-up: do you get any console output? When I run your file sans data on my development version, I see:

java.lang.RuntimeException: In javascript execution of _x (of class java.lang.String) cannot convert to Integer error: For input string: "_x" Possible fix: check the java_constants of the typeside for type conversion errors. at catdata.aql.AqlJs.parse(AqlJs.java:147) at catdata.aql.RawTerm.infer_good(RawTerm.java:302) at catdata.aql.RawTerm.infer_good(RawTerm.java:153) at catdata.aql.RawTerm.infer1x(RawTerm.java:353) at catdata.aql.exp.MapExpRaw.eval0(MapExpRaw.java:369)

Which is unexpected. But at the same time, it seems to run when pressing the run button. So I’m wondering if maybe there’s a mapping error induced by the inferencer and not caught by the type checker, that manifests as a bunch of silently caught errors inside the typeside, slowing everything down..

On Jul 22, 2020, at 10:08 AM, Ryan Wisnesky <ryan@conexus.com mailto:ryan@conexus.com> wrote:

It’s certainly possible that eg a cascade of internal typeside errors is leading to the performance degradation as CQL has to try-catch each row on import, but regardless my instinct is that something is going wrong internally even in that case, and that this isn’t user error - there just shouldn’t been so many GB worth of stuff that can even be created, period, so I’m curious to see what it is. Updates to follow.

On Jul 22, 2020, at 4:44 AM, Tim Pierson <notifications@github.com mailto:notifications@github.com> wrote:

(And, again for clarity, the last https://github.com/CategoricalData/CQL/issues/52#issuecomment-662393995 3 https://github.com/CategoricalData/CQL/issues/52#issuecomment-662396022 comments https://github.com/CategoricalData/CQL/issues/52#issuecomment-662402102 refer to this cql program, which is a subset of the test case I linked. https://gist.github.com/o1lo01ol1o/259373e9564d520928ad0671561f0745 https://gist.github.com/o1lo01ol1o/259373e9564d520928ad0671561f0745)

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/CategoricalData/CQL/issues/52#issuecomment-662406587, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2QKN4VCM72Q5OUEGXIK63R43GK7ANCNFSM4PDPKVXA.

o1lo01ol1o commented 4 years ago

do you get any console output

Yes, I also saw that but wasn't sure to make of it since no errors were thrown.

Follow-up: seems to be the mapping for (LoanApplicationRegistrar, derivedSex) in F_2018_2019_lar that it’s struggling with. derivedSex appears as both a typeside symbol and attribute name.

Intersting. I wasn't too clear on how names were resolved so I trusted that everything was fine since it compiled.

wisnesky commented 4 years ago

Confirmed: the CQL type inferencer was being overzealous, but the error appears to be benign because ultimately the correct type assignment is still found. I’ve fixed that error and will continue examining resource usage.

On Jul 22, 2020, at 2:25 PM, Tim Pierson notifications@github.com wrote:

do you get any console output

Yes, I also saw that but wasn't sure to make of it since no errors were thrown.

Follow-up: seems to be the mapping for (LoanApplicationRegistrar, derivedSex) in F_2018_2019_lar that it’s struggling with. derivedSex appears as both a typeside symbol and attribute name.

Intersting. I wasn't too clear on how names were resolved so I trusted that everything was fine since it compiled.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/CategoricalData/CQL/issues/52#issuecomment-662705133, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2QKN4N323M7NM3EMWMEDDR45KMVANCNFSM4PDPKVXA.

wisnesky commented 4 years ago

I've got the sample data loaded, but something is still not quite right, as I can't seem to access the data over JDBC (see attached). Any thoughts on what's going on?

o1lo01ol1o commented 4 years ago

huh. I don't know why that command isn't selecting. Are you able to, e.g., create a temp table from cql?

FWIW, the queries in the instances select from a table called lar20182019 in public and 4 tables in another schema whose name escapes me presently. You'll have to put those in the right place or alter the from clauses in each instance query

On Thu, Jul 23, 2020 at 9:49 PM Ryan Wisnesky notifications@github.com wrote:

[image: Screen Shot 2020-07-23 at 1 44 30 PM] https://user-images.githubusercontent.com/3474743/88337235-37934280-cceb-11ea-9b03-327ed75481f8.png I've got the sample data loaded, but something is still not quite right, as I can't seem to access the data over JDBC (see attached). Any thoughts on what's going on?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/CategoricalData/CQL/issues/52#issuecomment-663225280, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB5DC5KMJ2D54DJO65EYJW3R5CO7FANCNFSM4PDPKVXA .

-- 328 Plymouth St Brooklyn, NY 11201

617 990 7876

o1lo01ol1o commented 4 years ago

I'm using JDK 12, postgresql 12, and postgresql-jdbc 42.2.5, if that helps.

On Thu, Jul 23, 2020 at 10:20 PM tim pierson tim.pierson@gmail.com wrote:

huh. I don't know why that command isn't selecting. Are you able to, e.g., create a temp table from cql?

FWIW, the queries in the instances select from a table called lar20182019 in public and 4 tables in another schema whose name escapes me presently. You'll have to put those in the right place or alter the from clauses in each instance query

On Thu, Jul 23, 2020 at 9:49 PM Ryan Wisnesky notifications@github.com wrote:

[image: Screen Shot 2020-07-23 at 1 44 30 PM] https://user-images.githubusercontent.com/3474743/88337235-37934280-cceb-11ea-9b03-327ed75481f8.png I've got the sample data loaded, but something is still not quite right, as I can't seem to access the data over JDBC (see attached). Any thoughts on what's going on?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/CategoricalData/CQL/issues/52#issuecomment-663225280, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB5DC5KMJ2D54DJO65EYJW3R5CO7FANCNFSM4PDPKVXA .

-- 328 Plymouth St Brooklyn, NY 11201

617 990 7876

-- 328 Plymouth St Brooklyn, NY 11201

617 990 7876

o1lo01ol1o commented 4 years ago

@wisnesky Any luck on this?

wisnesky commented 4 years ago

I got a bit delayed looking into the Postgres over JDBC connection issue in advance of an upcoming product launch but I will try to return to this today.

On Jul 29, 2020, at 1:56 AM, Tim Pierson notifications@github.com wrote:

@wisnesky https://github.com/wisnesky Any luck on this?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/CategoricalData/CQL/issues/52#issuecomment-665533453, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2QKN5AHPW6SBKD73GYV63R57P4XANCNFSM4PDPKVXA.

o1lo01ol1o commented 4 years ago

Np; let me know if I can be helpful.

Sent from my iPhone

On Jul 29, 2020, at 8:32 PM, Ryan Wisnesky notifications@github.com wrote:

I got a bit delayed looking into the Postgres over JDBC connection issue in advance of an upcoming product launch but I will try to return to this today.

On Jul 29, 2020, at 1:56 AM, Tim Pierson notifications@github.com wrote:

@wisnesky https://github.com/wisnesky Any luck on this?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/CategoricalData/CQL/issues/52#issuecomment-665533453, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2QKN5AHPW6SBKD73GYV63R57P4XANCNFSM4PDPKVXA.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

wisnesky commented 4 years ago

Running

SELECT * FROM pg_catalog.pg_tables WHERE schemaname != 'pg_catalog' AND schemaname != 'information_schema';

From within the IDE returns an empty result, so it’s looking like jdbc driver must be pointing to the wrong spot. Hopefully it won’t be long now...

On Jul 29, 2020, at 12:37 PM, Tim Pierson notifications@github.com wrote:

Np; let me know if I can be helpful.

Sent from my iPhone

On Jul 29, 2020, at 8:32 PM, Ryan Wisnesky notifications@github.com wrote:

I got a bit delayed looking into the Postgres over JDBC connection issue in advance of an upcoming product launch but I will try to return to this today.

On Jul 29, 2020, at 1:56 AM, Tim Pierson notifications@github.com wrote:

@wisnesky https://github.com/wisnesky Any luck on this?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/CategoricalData/CQL/issues/52#issuecomment-665533453, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2QKN5AHPW6SBKD73GYV63R57P4XANCNFSM4PDPKVXA.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/CategoricalData/CQL/issues/52#issuecomment-665861861, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2QKN3KWBS5K5KKRH4ZTDTR6B3BHANCNFSM4PDPKVXA.

wisnesky commented 4 years ago

After running CQL through the profiler, the bottleneck is clear: it is construction of the type algebra for the sigma’d instance. In general, this will be a bottleneck, but in this particular case I’m fairly certain one of four things will work:

map the other direction and use delta (runs in constant time)
map the same direction but use a query. I tried to convert the mapping to a query directly using CQL’s toCoQuery command, but the resulting query was not obviously domain independent, nor did its “SQL unfolding” converge, so this may need to be done by hand.
- our upcoming new sigma chase algorithm should be significantly faster than the current sigma chase algorithm on this case, as it will also bypass type algebra construction
I can implement a special case of sigma for when the schema mapping is a ‘discrete op fibration’. Sigmas along such mappings do not alter type algebras and can be computed with unions quickly; unfortunately, in general, this condition is undecidable (although perhaps quite easy to establish for your mapping)

On Aug 3, 2020, at 8:22 PM, Ryan Wisnesky ryan@conexus.com wrote:

Running

SELECT * FROM pg_catalog.pg_tables WHERE schemaname != 'pg_catalog' AND schemaname != 'information_schema';

From within the IDE returns an empty result, so it’s looking like jdbc driver must be pointing to the wrong spot. Hopefully it won’t be long now...

On Jul 29, 2020, at 12:37 PM, Tim Pierson <notifications@github.com mailto:notifications@github.com> wrote:

Np; let me know if I can be helpful.

Sent from my iPhone

On Jul 29, 2020, at 8:32 PM, Ryan Wisnesky <notifications@github.com mailto:notifications@github.com> wrote:

I got a bit delayed looking into the Postgres over JDBC connection issue in advance of an upcoming product launch but I will try to return to this today.

On Jul 29, 2020, at 1:56 AM, Tim Pierson <notifications@github.com mailto:notifications@github.com> wrote:

@wisnesky <https://github.com/wisnesky https://github.com/wisnesky> Any luck on this?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <https://github.com/CategoricalData/CQL/issues/52#issuecomment-665533453 https://github.com/CategoricalData/CQL/issues/52#issuecomment-665533453>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AA2QKN5AHPW6SBKD73GYV63R57P4XANCNFSM4PDPKVXA https://github.com/notifications/unsubscribe-auth/AA2QKN5AHPW6SBKD73GYV63R57P4XANCNFSM4PDPKVXA>.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/CategoricalData/CQL/issues/52#issuecomment-665861861, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2QKN3KWBS5K5KKRH4ZTDTR6B3BHANCNFSM4PDPKVXA.

o1lo01ol1o commented 4 years ago

Excellent, thank you. I've swapped the arrows and used delta and the 6k subset completes in trivial time. The full 15mm rows on the other hand still seem to require 60gb of heap, but I suppose that's to be expected given the size. I'll let you know how long it takes to terminate. (Hopefully.)

On Tue, Aug 4, 2020 at 8:43 AM Ryan Wisnesky notifications@github.com wrote:

After running CQL through the profiler, the bottleneck is clear: it is construction of the type algebra for the sigma’d instance. In general, this will be a bottleneck, but in this particular case I’m fairly certain one of four things will work:

map the other direction and use delta (runs in constant time)

map the same direction but use a query. I tried to convert the mapping to a query directly using CQL’s toCoQuery command, but the resulting query was not obviously domain independent, nor did its “SQL unfolding” converge, so this may need to be done by hand.

our upcoming new sigma chase algorithm should be significantly faster than the current sigma chase algorithm on this case, as it will also bypass type algebra construction

I can implement a special case of sigma for when the schema mapping is a ‘discrete op fibration’. Sigmas along such mappings do not alter type algebras and can be computed with unions quickly; unfortunately, in general, this condition is undecidable (although perhaps quite easy to establish for your mapping)

On Aug 3, 2020, at 8:22 PM, Ryan Wisnesky ryan@conexus.com wrote:

Running

SELECT * FROM pg_catalog.pg_tables WHERE schemaname != 'pg_catalog' AND schemaname != 'information_schema';

From within the IDE returns an empty result, so it’s looking like jdbc driver must be pointing to the wrong spot. Hopefully it won’t be long now...

On Jul 29, 2020, at 12:37 PM, Tim Pierson <notifications@github.com mailto:notifications@github.com> wrote:

Np; let me know if I can be helpful.

Sent from my iPhone

On Jul 29, 2020, at 8:32 PM, Ryan Wisnesky <notifications@github.com mailto:notifications@github.com> wrote:

I got a bit delayed looking into the Postgres over JDBC connection issue in advance of an upcoming product launch but I will try to return to this today.

On Jul 29, 2020, at 1:56 AM, Tim Pierson <notifications@github.com mailto:notifications@github.com> wrote:

@wisnesky <https://github.com/wisnesky https://github.com/wisnesky> Any luck on this?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub < https://github.com/CategoricalData/CQL/issues/52#issuecomment-665533453 < https://github.com/CategoricalData/CQL/issues/52#issuecomment-665533453>>, or unsubscribe < https://github.com/notifications/unsubscribe-auth/AA2QKN5AHPW6SBKD73GYV63R57P4XANCNFSM4PDPKVXA < https://github.com/notifications/unsubscribe-auth/AA2QKN5AHPW6SBKD73GYV63R57P4XANCNFSM4PDPKVXA .

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub < https://github.com/CategoricalData/CQL/issues/52#issuecomment-665861861>, or unsubscribe < https://github.com/notifications/unsubscribe-auth/AA2QKN3KWBS5K5KKRH4ZTDTR6B3BHANCNFSM4PDPKVXA .

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/CategoricalData/CQL/issues/52#issuecomment-668437437, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB5DC5JFEHAKHEMJNZLST6TR663ZRANCNFSM4PDPKVXA .

-- 328 Plymouth St Brooklyn, NY 11201

617 990 7876

o1lo01ol1o commented 4 years ago

This ended up requiring more work to get going: the command line interface would crash after a few hours (nixos / jdk 12) with an error that looked like a segfault (not OOM). I moved the process to a windows 10 machine with 128gb of ram and, after several unsuccessful attempts with the IDE via RDP (OOM; pagefile woes with the system-managed files), I was able to get the process to complete using 6 threads. This appeared to take approx. 72 hours and required >368gb of swap space in addition to the 128gb of ram to successfully export. Unfortunately, I was unable to RDP back in after about 60 hours, so I couldn't log the output of the process, only that the expected number of rows were in the database on reboot.

Given that there was only the single delta in this CQL program, in the future what can I do to keep the memory footprint manageable? I'm guessing from your previous analysis that all the overhead comes from the numerous observed variables and their types / type algebras?

(I'm retrying the nixos machine with lots of swapspace incase the previous errors were just OOM in disguise.)

wisnesky commented 4 years ago

The performance issues that you’re seeing have been observed before; there’s a new version of CQL containing a year’s worth of performance improvements (such as constant time delta) stuck on my laptop because of errors in one particular component (the sigma_chase primitive). Needing 368gb of swap space is of course not needed algorithmically in general, but happens because CQL is literally running exponential algorithms, so it’s quite cool to see that everything does actually work if given enough firepower :-). What’s happening is Knuth-Bendix completion on a database that has been converted into a set of equations.

Anyway, I think the answer to your question is that I should push the new version of CQL even without the updated sigma_chase, so that you can see if it helps, then I should vow to be more disciplined about releases in the future, and then I should add the special-case sigma operation we talked about before as well. I was able to catch up somewhat over the weekend, so I can probably get to that over the next few days.

On Aug 10, 2020, at 3:08 PM, Tim Pierson notifications@github.com wrote:

This ended up requiring more work to get going: the command line interface would crash after a few hours (nixos / jdk 12) with an error that looked like a segfault (not OOM). I moved the process to a windows 10 machine with 128gb of ram and, after several unsuccessful attempts with the IDE via RDP (OOM; pagefile woes with the system-managed files), I was able to get the process to complete using 6 threads. This appeared to take approx. 72 hours and required >368gb of swap space in addition to the 128gb of ram to successfully export. Unfortunately, I was unable to RDP back in after about 60 hours, so I couldn't log the output of the process, only that the expected number of rows were in the database on reboot.

Given that there was only the single delta in this CQL program, in the future what can I do to keep the memory footprint manageable? I'm guessing from your previous analysis that all the overhead comes from the numerous observed variables and their types / type algebras?

(I'm retrying the nixos machine with lots of swapspace incase the previous errors were just OOM in disguise.)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/CategoricalData/CQL/issues/52#issuecomment-671614298, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2QKN6P72MJEHNB3S75NHDSABVVJANCNFSM4PDPKVXA.

wisnesky commented 4 years ago

I've placed a new jarfile here: http://categoricaldata.net/cql_experimental.jar . It requires java 14, its chase function is broken, sigma will always be slow and I haven't added any of the discussed new functionality to it. However, it has a lot of latent changes that should improve performance in general, as well as some fixes to eg jdbc import.

o1lo01ol1o commented 4 years ago

Ok, thanks, I'll use it for the next delta mapping I need to process. I think this issue has been sufficiently explored now. :)