Closed o1lo01ol1o closed 4 years ago
Could the result of the chase actually be infinite? For example, that can happen if there is a cycle where entity X causes entity X’ to exist causes entity X’’ to exist causes… and unfortunately, this criteria is undecidable, although there are conservative approximations that are on the todo list.
That being said, I think the culprit may actually be coproduct. It functions by creating a decision procedure, like how sigma is working. The reason is that in general, the coproduct of two instances will not have a type algebra that is the coproduct of the input type algebras (e.g., when coproducting SQL DBs, you will not end up with two copies of NULL. However, coproduct will still be fast when the input instances have free type algebras, but maybe yours don’t due to the non-trivial typeside equations; I’m not sure). You might try sigma_chase along a mapping that implements a coproduct of tables instead, or even a quotient_query.
The chemistry example does use a coproduct on a 10gb data set, so I’m not really sure what is going on. You might try the version of the jarfile the author archived; maybe there’s been a regression.
On Jul 21, 2020, at 4:24 AM, Tim Pierson notifications@github.com wrote:
A follow up from #46 https://github.com/CategoricalData/CQL/issues/46 :
I updated the linked gist to include the coproduct of the two tables to which I'm trying to migrate.
Approx 60k rows between the two for this trial subset; The sigma_chase version ran for 14 hours using a maximum of 30gb (!) of jvm heap before I terminated it ; This is on: MBP 2019 i7 with 16gb RAM ; Mojave; JDK 12. After deleting 54k rows from the above and running the program again, the process has been running for ~3 hours with up to 25gb of heap usage. It's currently processing the last sigma_chase on the forward mapping prior to the coproduct.
I did a JIT "warm-up" of a half dozen runs of the Constraints demo before starting both of the jobs and I restarted the CQL IDE in between.
Is there something wrong here or is performance on the commercial version better by 2-3 orders of magnitude?
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/CategoricalData/CQL/issues/52, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2QKNYMWBG6TIRJBPGBOWTR4V3FLANCNFSM4PDPKVXA.
I've commented out the coproduct
and the modify
renaming schema and the 6k rows didn't terminate in the 2+ hours I let them run. The archived CQL.jar fared no better.
Could the result of the chase actually be infinite?
It would have to be the chase result of the mapping defined by this colimit schema, but I'm not sure how it could be infinite as I don't see any referential cycles.
Really all we're doing is pulling some attributes out into entities and saying some of those entities are equal between the two schemata. The only mapping that seems slightly non trivial is when we need to make a new entity Foo
from repeated attribute columns: foo_1
, foo_2
, foo_3
. But this should just create a linking table between the main records and the universe of Foo
s that occur.
A 6k row instance non-cyclic chase should be doable even with the naive algorithm; the crazy memory usage suggests that something is spinning out of control, since CQL should only keep one copy of every term in creates (so to generate eg 30gb requires lots of new “stuff”). Can you hook CQL up to the profiler at https://visualvm.github.io https://visualvm.github.io/ and try to figure out if e.g. CQL is entering a loop and/or where the memory usage is going? Alternatively I’m happy to do it myself if you can make available a test case.
On Jul 21, 2020, at 8:36 AM, Tim Pierson notifications@github.com wrote:
I've commented out the coproduct and the modify renaming schema and the 6k rows didn't terminate in the 2+ hours I let them run. The archived CQL.jar fared no better.
Could the result of the chase actually be infinite?
It would have to be the chase result of the mapping defined by this colimit schema https://gist.github.com/o1lo01ol1o/4c32806da119d97c01ee91278de41348, but I'm not sure how it could be infinite as I don't see any referential cycles.
Really all we're doing is pulling some attributes out into entities and saying some of those entities are equal between the two schemata. The only mapping that seems slightly non trivial is when we need to make a new entity Foo from repeated attribute columns: foo_1, foo_2, foo_3. But this should just create a linking table between the main records and the universe of Foos that occur.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/CategoricalData/CQL/issues/52#issuecomment-661934922, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2QKN23SDXLHSEVZASHV6DR4WYXFANCNFSM4PDPKVXA.
the crazy memory usage suggests that something is spinning out of control
It might be the typesides. Just simply trying to import the stringly-typed source and map it to through the observation equations to an instance uses 110gb of heap for a 6.5gb table.
Here's ~6k of rows of and the coproduct cql discussed above.
The typeside seems to be fine. I’ll try to run this through the profiler tomorrow (Weds). Does any subset of the input data complete? If not the error is probably not about scalability per se.
On Jul 22, 2020, at 2:13 AM, Tim Pierson notifications@github.com wrote:
the crazy memory usage suggests that something is spinning out of control
It might be the typesides. Just simply trying to import the stringly-typed source and map it to through the observation equations to an instance uses 110gb of heap for a 6.5gb table.
Here's ~6k of rows of and the coproduct cql discussed above.
https://www.dropbox.com/s/pw1o4upc7cdb9d5/TestCase.zip?dl=0 https://www.dropbox.com/s/pw1o4upc7cdb9d5/TestCase.zip?dl=0 — You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/CategoricalData/CQL/issues/52#issuecomment-662340862, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2QKNYRU6FTR3D7QIH3IBLR42UVDANCNFSM4PDPKVXA.
instance J_2018_2019_Fwd = sigma_chase (getMapping SanitizedColimit_2014_2019 S_2018_2019_lar_with_ref) J_2018_2019_lar_with_ref
Will complete (though this is only 1k lines in the abridged data). The other Fwd
chase does not complete (in reasonable time, anyway).
Digging a little deeper, perhaps there's been some user error on my part?
If I change the typeside
function derivedRace
to output a constant String:
derivedRace : Integer -> Varchar = " return 'test';"
and call it in the observation equations of my transformed schema (as I understood the #46 to describe):
observation_equations
...
forall x. raceDerived(x) = derivedRace(raceAsAtt(x))
...
The resulting schema (in csv / postgres) looks correct, however, in the refRace
table there are fk
s to the main table where there should yet both raceAsAtt
and raceDerived
are null for all values:
"id","raceAsAtt","raceDerived"
"3981",,
"3982",,
"3983",,
"3984",,
When I run the query for the refRace
entity against the database, I do indeed get the universe of entities as Integers
. Could the cql
program as written be mangling something?
For the sake of clarity, the above comment refers to the instance created by the first mapping S to S’
as discussed in #46
Sorry for the deluge of info: changing to sigma
on the above mapping throws some typeside conversion errors on asDouble
; fixing them then generates ref
tables with the correct values.
However, this is fairly slow as discussed and changing back to sigma_chase
results in null
valued tables again.
(And, again for clarity, the last 3 comments refer to this cql program, which is a subset of the test case I linked. https://gist.github.com/o1lo01ol1o/259373e9564d520928ad0671561f0745)
It’s certainly possible that eg a cascade of internal typeside errors is leading to the performance degradation as CQL has to try-catch each row on import, but regardless my instinct is that something is going wrong internally even in that case, and that this isn’t user error - there just shouldn’t been so many GB worth of stuff that can even be created, period, so I’m curious to see what it is. Updates to follow.
On Jul 22, 2020, at 4:44 AM, Tim Pierson notifications@github.com wrote:
(And, again for clarity, the last https://github.com/CategoricalData/CQL/issues/52#issuecomment-662393995 3 https://github.com/CategoricalData/CQL/issues/52#issuecomment-662396022 comments https://github.com/CategoricalData/CQL/issues/52#issuecomment-662402102 refer to this cql program, which is a subset of the test case I linked. https://gist.github.com/o1lo01ol1o/259373e9564d520928ad0671561f0745 https://gist.github.com/o1lo01ol1o/259373e9564d520928ad0671561f0745)
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/CategoricalData/CQL/issues/52#issuecomment-662406587, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2QKN4VCM72Q5OUEGXIK63R43GK7ANCNFSM4PDPKVXA.
Follow-up: do you get any console output? When I run your file sans data on my development version, I see:
java.lang.RuntimeException: In javascript execution of _x (of class java.lang.String) cannot convert to Integer error: For input string: "_x" Possible fix: check the java_constants of the typeside for type conversion errors. at catdata.aql.AqlJs.parse(AqlJs.java:147) at catdata.aql.RawTerm.infer_good(RawTerm.java:302) at catdata.aql.RawTerm.infer_good(RawTerm.java:153) at catdata.aql.RawTerm.infer1x(RawTerm.java:353) at catdata.aql.exp.MapExpRaw.eval0(MapExpRaw.java:369)
Which is unexpected. But at the same time, it seems to run when pressing the run button. So I’m wondering if maybe there’s a mapping error induced by the inferencer and not caught by the type checker, that manifests as a bunch of silently caught errors inside the typeside, slowing everything down..
On Jul 22, 2020, at 10:08 AM, Ryan Wisnesky ryan@conexus.com wrote:
It’s certainly possible that eg a cascade of internal typeside errors is leading to the performance degradation as CQL has to try-catch each row on import, but regardless my instinct is that something is going wrong internally even in that case, and that this isn’t user error - there just shouldn’t been so many GB worth of stuff that can even be created, period, so I’m curious to see what it is. Updates to follow.
On Jul 22, 2020, at 4:44 AM, Tim Pierson <notifications@github.com mailto:notifications@github.com> wrote:
(And, again for clarity, the last https://github.com/CategoricalData/CQL/issues/52#issuecomment-662393995 3 https://github.com/CategoricalData/CQL/issues/52#issuecomment-662396022 comments https://github.com/CategoricalData/CQL/issues/52#issuecomment-662402102 refer to this cql program, which is a subset of the test case I linked. https://gist.github.com/o1lo01ol1o/259373e9564d520928ad0671561f0745 https://gist.github.com/o1lo01ol1o/259373e9564d520928ad0671561f0745)
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/CategoricalData/CQL/issues/52#issuecomment-662406587, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2QKN4VCM72Q5OUEGXIK63R43GK7ANCNFSM4PDPKVXA.
Follow-up: seems to be the mapping for (LoanApplicationRegistrar, derivedSex) in F_2018_2019_lar that it’s struggling with.
On Jul 22, 2020, at 2:05 PM, Ryan Wisnesky ryan@conexus.com wrote:
Follow-up: do you get any console output? When I run your file sans data on my development version, I see:
java.lang.RuntimeException: In javascript execution of _x (of class java.lang.String) cannot convert to Integer error: For input string: "_x" Possible fix: check the java_constants of the typeside for type conversion errors. at catdata.aql.AqlJs.parse(AqlJs.java:147) at catdata.aql.RawTerm.infer_good(RawTerm.java:302) at catdata.aql.RawTerm.infer_good(RawTerm.java:153) at catdata.aql.RawTerm.infer1x(RawTerm.java:353) at catdata.aql.exp.MapExpRaw.eval0(MapExpRaw.java:369)
Which is unexpected. But at the same time, it seems to run when pressing the run button. So I’m wondering if maybe there’s a mapping error induced by the inferencer and not caught by the type checker, that manifests as a bunch of silently caught errors inside the typeside, slowing everything down..
On Jul 22, 2020, at 10:08 AM, Ryan Wisnesky <ryan@conexus.com mailto:ryan@conexus.com> wrote:
It’s certainly possible that eg a cascade of internal typeside errors is leading to the performance degradation as CQL has to try-catch each row on import, but regardless my instinct is that something is going wrong internally even in that case, and that this isn’t user error - there just shouldn’t been so many GB worth of stuff that can even be created, period, so I’m curious to see what it is. Updates to follow.
On Jul 22, 2020, at 4:44 AM, Tim Pierson <notifications@github.com mailto:notifications@github.com> wrote:
(And, again for clarity, the last https://github.com/CategoricalData/CQL/issues/52#issuecomment-662393995 3 https://github.com/CategoricalData/CQL/issues/52#issuecomment-662396022 comments https://github.com/CategoricalData/CQL/issues/52#issuecomment-662402102 refer to this cql program, which is a subset of the test case I linked. https://gist.github.com/o1lo01ol1o/259373e9564d520928ad0671561f0745 https://gist.github.com/o1lo01ol1o/259373e9564d520928ad0671561f0745)
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/CategoricalData/CQL/issues/52#issuecomment-662406587, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2QKN4VCM72Q5OUEGXIK63R43GK7ANCNFSM4PDPKVXA.
Follow-up2: derivedSex appears as both a typeside symbol and attribute name. It’s possible the type inferencer is choosing the wrong one.
On Jul 22, 2020, at 2:11 PM, Ryan Wisnesky ryan@conexus.com wrote:
Follow-up: seems to be the mapping for (LoanApplicationRegistrar, derivedSex) in F_2018_2019_lar that it’s struggling with.
On Jul 22, 2020, at 2:05 PM, Ryan Wisnesky <ryan@conexus.com mailto:ryan@conexus.com> wrote:
Follow-up: do you get any console output? When I run your file sans data on my development version, I see:
java.lang.RuntimeException: In javascript execution of _x (of class java.lang.String) cannot convert to Integer error: For input string: "_x" Possible fix: check the java_constants of the typeside for type conversion errors. at catdata.aql.AqlJs.parse(AqlJs.java:147) at catdata.aql.RawTerm.infer_good(RawTerm.java:302) at catdata.aql.RawTerm.infer_good(RawTerm.java:153) at catdata.aql.RawTerm.infer1x(RawTerm.java:353) at catdata.aql.exp.MapExpRaw.eval0(MapExpRaw.java:369)
Which is unexpected. But at the same time, it seems to run when pressing the run button. So I’m wondering if maybe there’s a mapping error induced by the inferencer and not caught by the type checker, that manifests as a bunch of silently caught errors inside the typeside, slowing everything down..
On Jul 22, 2020, at 10:08 AM, Ryan Wisnesky <ryan@conexus.com mailto:ryan@conexus.com> wrote:
It’s certainly possible that eg a cascade of internal typeside errors is leading to the performance degradation as CQL has to try-catch each row on import, but regardless my instinct is that something is going wrong internally even in that case, and that this isn’t user error - there just shouldn’t been so many GB worth of stuff that can even be created, period, so I’m curious to see what it is. Updates to follow.
On Jul 22, 2020, at 4:44 AM, Tim Pierson <notifications@github.com mailto:notifications@github.com> wrote:
(And, again for clarity, the last https://github.com/CategoricalData/CQL/issues/52#issuecomment-662393995 3 https://github.com/CategoricalData/CQL/issues/52#issuecomment-662396022 comments https://github.com/CategoricalData/CQL/issues/52#issuecomment-662402102 refer to this cql program, which is a subset of the test case I linked. https://gist.github.com/o1lo01ol1o/259373e9564d520928ad0671561f0745 https://gist.github.com/o1lo01ol1o/259373e9564d520928ad0671561f0745)
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/CategoricalData/CQL/issues/52#issuecomment-662406587, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2QKN4VCM72Q5OUEGXIK63R43GK7ANCNFSM4PDPKVXA.
do you get any console output
Yes, I also saw that but wasn't sure to make of it since no errors were thrown.
Follow-up: seems to be the mapping for (LoanApplicationRegistrar, derivedSex) in F_2018_2019_lar that it’s struggling with. derivedSex appears as both a typeside symbol and attribute name.
Intersting. I wasn't too clear on how names were resolved so I trusted that everything was fine since it compiled.
Confirmed: the CQL type inferencer was being overzealous, but the error appears to be benign because ultimately the correct type assignment is still found. I’ve fixed that error and will continue examining resource usage.
On Jul 22, 2020, at 2:25 PM, Tim Pierson notifications@github.com wrote:
do you get any console output
Yes, I also saw that but wasn't sure to make of it since no errors were thrown.
Follow-up: seems to be the mapping for (LoanApplicationRegistrar, derivedSex) in F_2018_2019_lar that it’s struggling with. derivedSex appears as both a typeside symbol and attribute name.
Intersting. I wasn't too clear on how names were resolved so I trusted that everything was fine since it compiled.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/CategoricalData/CQL/issues/52#issuecomment-662705133, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2QKN4N323M7NM3EMWMEDDR45KMVANCNFSM4PDPKVXA.
I've got the sample data loaded, but something is still not quite right, as I can't seem to access the data over JDBC (see attached). Any thoughts on what's going on?
huh. I don't know why that command isn't selecting. Are you able to,
e.g., create a temp table from cql
?
FWIW, the queries in the instances select from a table called lar20182019
in public
and 4 tables in another schema whose name escapes me
presently. You'll have to put those in the right place or alter the from
clauses in each instance query
On Thu, Jul 23, 2020 at 9:49 PM Ryan Wisnesky notifications@github.com wrote:
[image: Screen Shot 2020-07-23 at 1 44 30 PM] https://user-images.githubusercontent.com/3474743/88337235-37934280-cceb-11ea-9b03-327ed75481f8.png I've got the sample data loaded, but something is still not quite right, as I can't seem to access the data over JDBC (see attached). Any thoughts on what's going on?
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/CategoricalData/CQL/issues/52#issuecomment-663225280, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB5DC5KMJ2D54DJO65EYJW3R5CO7FANCNFSM4PDPKVXA .
-- 328 Plymouth St Brooklyn, NY 11201
617 990 7876
I'm using JDK 12, postgresql 12, and postgresql-jdbc 42.2.5, if that helps.
On Thu, Jul 23, 2020 at 10:20 PM tim pierson tim.pierson@gmail.com wrote:
huh. I don't know why that command isn't selecting. Are you able to, e.g., create a temp table from
cql
?FWIW, the queries in the instances select from a table called
lar20182019
inpublic
and 4 tables in another schema whose name escapes me presently. You'll have to put those in the right place or alter thefrom
clauses in each instance queryOn Thu, Jul 23, 2020 at 9:49 PM Ryan Wisnesky notifications@github.com wrote:
[image: Screen Shot 2020-07-23 at 1 44 30 PM] https://user-images.githubusercontent.com/3474743/88337235-37934280-cceb-11ea-9b03-327ed75481f8.png I've got the sample data loaded, but something is still not quite right, as I can't seem to access the data over JDBC (see attached). Any thoughts on what's going on?
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/CategoricalData/CQL/issues/52#issuecomment-663225280, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB5DC5KMJ2D54DJO65EYJW3R5CO7FANCNFSM4PDPKVXA .
-- 328 Plymouth St Brooklyn, NY 11201
617 990 7876
-- 328 Plymouth St Brooklyn, NY 11201
617 990 7876
@wisnesky Any luck on this?
I got a bit delayed looking into the Postgres over JDBC connection issue in advance of an upcoming product launch but I will try to return to this today.
On Jul 29, 2020, at 1:56 AM, Tim Pierson notifications@github.com wrote:
@wisnesky https://github.com/wisnesky Any luck on this?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/CategoricalData/CQL/issues/52#issuecomment-665533453, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2QKN5AHPW6SBKD73GYV63R57P4XANCNFSM4PDPKVXA.
Np; let me know if I can be helpful.
Sent from my iPhone
On Jul 29, 2020, at 8:32 PM, Ryan Wisnesky notifications@github.com wrote:
I got a bit delayed looking into the Postgres over JDBC connection issue in advance of an upcoming product launch but I will try to return to this today.
On Jul 29, 2020, at 1:56 AM, Tim Pierson notifications@github.com wrote:
@wisnesky https://github.com/wisnesky Any luck on this?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/CategoricalData/CQL/issues/52#issuecomment-665533453, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2QKN5AHPW6SBKD73GYV63R57P4XANCNFSM4PDPKVXA.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.
Running
SELECT * FROM pg_catalog.pg_tables WHERE schemaname != 'pg_catalog' AND schemaname != 'information_schema';
From within the IDE returns an empty result, so it’s looking like jdbc driver must be pointing to the wrong spot. Hopefully it won’t be long now...
On Jul 29, 2020, at 12:37 PM, Tim Pierson notifications@github.com wrote:
Np; let me know if I can be helpful.
Sent from my iPhone
On Jul 29, 2020, at 8:32 PM, Ryan Wisnesky notifications@github.com wrote:
I got a bit delayed looking into the Postgres over JDBC connection issue in advance of an upcoming product launch but I will try to return to this today.
On Jul 29, 2020, at 1:56 AM, Tim Pierson notifications@github.com wrote:
@wisnesky https://github.com/wisnesky Any luck on this?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/CategoricalData/CQL/issues/52#issuecomment-665533453, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2QKN5AHPW6SBKD73GYV63R57P4XANCNFSM4PDPKVXA.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/CategoricalData/CQL/issues/52#issuecomment-665861861, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2QKN3KWBS5K5KKRH4ZTDTR6B3BHANCNFSM4PDPKVXA.
After running CQL through the profiler, the bottleneck is clear: it is construction of the type algebra for the sigma’d instance. In general, this will be a bottleneck, but in this particular case I’m fairly certain one of four things will work:
On Aug 3, 2020, at 8:22 PM, Ryan Wisnesky ryan@conexus.com wrote:
Running
SELECT * FROM pg_catalog.pg_tables WHERE schemaname != 'pg_catalog' AND schemaname != 'information_schema';
From within the IDE returns an empty result, so it’s looking like jdbc driver must be pointing to the wrong spot. Hopefully it won’t be long now...
On Jul 29, 2020, at 12:37 PM, Tim Pierson <notifications@github.com mailto:notifications@github.com> wrote:
Np; let me know if I can be helpful.
Sent from my iPhone
On Jul 29, 2020, at 8:32 PM, Ryan Wisnesky <notifications@github.com mailto:notifications@github.com> wrote:
I got a bit delayed looking into the Postgres over JDBC connection issue in advance of an upcoming product launch but I will try to return to this today.
On Jul 29, 2020, at 1:56 AM, Tim Pierson <notifications@github.com mailto:notifications@github.com> wrote:
@wisnesky <https://github.com/wisnesky https://github.com/wisnesky> Any luck on this?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <https://github.com/CategoricalData/CQL/issues/52#issuecomment-665533453 https://github.com/CategoricalData/CQL/issues/52#issuecomment-665533453>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AA2QKN5AHPW6SBKD73GYV63R57P4XANCNFSM4PDPKVXA https://github.com/notifications/unsubscribe-auth/AA2QKN5AHPW6SBKD73GYV63R57P4XANCNFSM4PDPKVXA>.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/CategoricalData/CQL/issues/52#issuecomment-665861861, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2QKN3KWBS5K5KKRH4ZTDTR6B3BHANCNFSM4PDPKVXA.
Excellent, thank you. I've swapped the arrows and used delta
and the 6k
subset completes in trivial time. The full 15mm rows on the other hand
still seem to require 60gb of heap, but I suppose that's to be expected
given the size. I'll let you know how long it takes to terminate.
(Hopefully.)
On Tue, Aug 4, 2020 at 8:43 AM Ryan Wisnesky notifications@github.com wrote:
After running CQL through the profiler, the bottleneck is clear: it is construction of the type algebra for the sigma’d instance. In general, this will be a bottleneck, but in this particular case I’m fairly certain one of four things will work:
- map the other direction and use delta (runs in constant time)
- map the same direction but use a query. I tried to convert the mapping to a query directly using CQL’s toCoQuery command, but the resulting query was not obviously domain independent, nor did its “SQL unfolding” converge, so this may need to be done by hand.
- our upcoming new sigma chase algorithm should be significantly faster than the current sigma chase algorithm on this case, as it will also bypass type algebra construction
- I can implement a special case of sigma for when the schema mapping is a ‘discrete op fibration’. Sigmas along such mappings do not alter type algebras and can be computed with unions quickly; unfortunately, in general, this condition is undecidable (although perhaps quite easy to establish for your mapping)
On Aug 3, 2020, at 8:22 PM, Ryan Wisnesky ryan@conexus.com wrote:
Running
SELECT * FROM pg_catalog.pg_tables WHERE schemaname != 'pg_catalog' AND schemaname != 'information_schema';
From within the IDE returns an empty result, so it’s looking like jdbc driver must be pointing to the wrong spot. Hopefully it won’t be long now...
On Jul 29, 2020, at 12:37 PM, Tim Pierson <notifications@github.com mailto:notifications@github.com> wrote:
Np; let me know if I can be helpful.
Sent from my iPhone
On Jul 29, 2020, at 8:32 PM, Ryan Wisnesky <notifications@github.com mailto:notifications@github.com> wrote:
I got a bit delayed looking into the Postgres over JDBC connection issue in advance of an upcoming product launch but I will try to return to this today.
On Jul 29, 2020, at 1:56 AM, Tim Pierson <notifications@github.com mailto:notifications@github.com> wrote:
@wisnesky <https://github.com/wisnesky https://github.com/wisnesky> Any luck on this?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub < https://github.com/CategoricalData/CQL/issues/52#issuecomment-665533453 < https://github.com/CategoricalData/CQL/issues/52#issuecomment-665533453>>, or unsubscribe < https://github.com/notifications/unsubscribe-auth/AA2QKN5AHPW6SBKD73GYV63R57P4XANCNFSM4PDPKVXA < https://github.com/notifications/unsubscribe-auth/AA2QKN5AHPW6SBKD73GYV63R57P4XANCNFSM4PDPKVXA .
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub < https://github.com/CategoricalData/CQL/issues/52#issuecomment-665861861>, or unsubscribe < https://github.com/notifications/unsubscribe-auth/AA2QKN3KWBS5K5KKRH4ZTDTR6B3BHANCNFSM4PDPKVXA .
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/CategoricalData/CQL/issues/52#issuecomment-668437437, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB5DC5JFEHAKHEMJNZLST6TR663ZRANCNFSM4PDPKVXA .
-- 328 Plymouth St Brooklyn, NY 11201
617 990 7876
This ended up requiring more work to get going: the command line interface would crash after a few hours (nixos / jdk 12) with an error that looked like a segfault (not OOM). I moved the process to a windows 10 machine with 128gb of ram and, after several unsuccessful attempts with the IDE via RDP (OOM; pagefile woes with the system-managed files), I was able to get the process to complete using 6 threads. This appeared to take approx. 72 hours and required >368gb of swap space in addition to the 128gb of ram to successfully export. Unfortunately, I was unable to RDP back in after about 60 hours, so I couldn't log the output of the process, only that the expected number of rows were in the database on reboot.
Given that there was only the single delta
in this CQL program, in the future what can I do to keep the memory footprint manageable? I'm guessing from your previous analysis that all the overhead comes from the numerous observed variables and their types / type algebras?
(I'm retrying the nixos machine with lots of swapspace incase the previous errors were just OOM in disguise.)
The performance issues that you’re seeing have been observed before; there’s a new version of CQL containing a year’s worth of performance improvements (such as constant time delta) stuck on my laptop because of errors in one particular component (the sigma_chase primitive). Needing 368gb of swap space is of course not needed algorithmically in general, but happens because CQL is literally running exponential algorithms, so it’s quite cool to see that everything does actually work if given enough firepower :-). What’s happening is Knuth-Bendix completion on a database that has been converted into a set of equations.
Anyway, I think the answer to your question is that I should push the new version of CQL even without the updated sigma_chase, so that you can see if it helps, then I should vow to be more disciplined about releases in the future, and then I should add the special-case sigma operation we talked about before as well. I was able to catch up somewhat over the weekend, so I can probably get to that over the next few days.
On Aug 10, 2020, at 3:08 PM, Tim Pierson notifications@github.com wrote:
This ended up requiring more work to get going: the command line interface would crash after a few hours (nixos / jdk 12) with an error that looked like a segfault (not OOM). I moved the process to a windows 10 machine with 128gb of ram and, after several unsuccessful attempts with the IDE via RDP (OOM; pagefile woes with the system-managed files), I was able to get the process to complete using 6 threads. This appeared to take approx. 72 hours and required >368gb of swap space in addition to the 128gb of ram to successfully export. Unfortunately, I was unable to RDP back in after about 60 hours, so I couldn't log the output of the process, only that the expected number of rows were in the database on reboot.
Given that there was only the single delta in this CQL program, in the future what can I do to keep the memory footprint manageable? I'm guessing from your previous analysis that all the overhead comes from the numerous observed variables and their types / type algebras?
(I'm retrying the nixos machine with lots of swapspace incase the previous errors were just OOM in disguise.)
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/CategoricalData/CQL/issues/52#issuecomment-671614298, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2QKN6P72MJEHNB3S75NHDSABVVJANCNFSM4PDPKVXA.
I've placed a new jarfile here: http://categoricaldata.net/cql_experimental.jar . It requires java 14, its chase function is broken, sigma will always be slow and I haven't added any of the discussed new functionality to it. However, it has a lot of latent changes that should improve performance in general, as well as some fixes to eg jdbc import.
Ok, thanks, I'll use it for the next delta
mapping I need to process. I think this issue has been sufficiently explored now. :)
A follow up from #46 :
I updated the linked gist to include the coproduct of the two tables to which I'm trying to migrate.
sigma_chase
version ran for 14 hours using a maximum of30gb
(!) ofjvm
heap before I terminated it ;i7
with16gb
RAM ;Mojave
;JDK 12
.After deleting 54k rows from the above and running the program again, the process has been running for ~3 hours with up to 25gb of heap usage. It's currently processing the last
sigma_chase
on the forward mapping prior to the coproduct.I did a JIT "warm-up" of a half dozen runs of the
Constraints
demo before starting both of the jobs and I restarted the CQL IDE in between.Is there something wrong here or is performance on the commercial version better by 2-3 orders of magnitude?