container updates aren't propagating to FLAT

dustymc commented 5 years ago

See email from @DerekSikes "some with and some without labels" 2019-02-21 - changes to the container stack (new container type) are not propagating to flat.partdetail

dustymc commented 5 years ago

I can't find a realistic way to do this. We're over-caching for performance reasons and there's nothing that can link a container type change (for example) to a part without adding a great deal of expense/complexity. Suggest we periodically auto-rebuild flat, find the processors to directly address containers, or ???????

campmlc commented 5 years ago

How do we get more processors? Is this a discussion with TACC? A grant?

On Mon, Feb 25, 2019 at 8:15 AM dustymc notifications@github.com wrote:

I can't find a realistic way to do this. We're over-caching for performance reasons and there's nothing that can link a container type change (for example) to a part without adding a great deal of expense/complexity. Suggest we periodically auto-rebuild flat, find the processors to directly address containers, or ???????

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ArctosDB/arctos/issues/1945#issuecomment-467049193, or mute the thread https://github.com/notifications/unsubscribe-auth/AOH0hNfzeWFVyGfD7xKW496s_5LOWizTks5vQ_4CgaJpZM4bIZZC .

dustymc commented 5 years ago

This problem occurs at the many:many intersection of...

UAM@ARCTOS>  select count(*) from specimen_part;

  COUNT(*)
----------
   5026559

and...

UAM@ARCTOS> select count(*) from container;

  COUNT(*)
----------
   9224840

two large tables that have cruddy index stats because they're updated constantly, and both are optimized in favor up update rather than query.

Container relates to itself, so the numbers of rows is a simpler picture than the number of things the optimizer actually has to sort through as well.

TACC is into "big data" and might be interested in working on a solution to things like this, but I certainly wouldn't know what to put in a request or grant. I'd guess something like striping the data across many solid-state drives would be the most likely to pay off, but that's best addressed by an Oracle performance expert. Maybe paying someone for a system tune-up is a sellable proposal. I don't know if that would fix this or not, but it should improve overall performance.

This seems to be a relatively rare problem - it only happens when an upstream container type changes, and we don't seem to do that very much. There are probably other things that should trigger FLAT to refresh and don't - that's sort of expected when trying to squish a complex system into an overgrown spreadsheet. It's not exactly a fix, but just refreshing FLAT when we have spare processors would probably make this unnoticeable, and would fix anything else that should have been recached but wasn't. TACC is moving us to Oracle 12 now, and it has some additional priority settings that we can probably use to get at "when we have spare processors."

If that's not sufficient, I can take a closer look at the "great deal of expense/complexity" option. Oracle 12 should make that easier as well - it "speaks" the language in which we store the cached data, which minimally should make it easier to tell when something's out of sync.

campmlc commented 5 years ago

OK - you had mentioned needing more processors before, so I'm happy to start that conversation with TACC if we have a clear need and direction. And we should add it to our list of things to budget for if there will be an associated cost.

On Mon, Feb 25, 2019 at 11:02 AM dustymc notifications@github.com wrote:

This problem occurs at the many:many intersection of...

UAM@ARCTOS> select count(*) from specimen_part;

COUNT(*)

5026559

and...

UAM@ARCTOS> select count(*) from container;

COUNT(*)

9224840

two large tables that have cruddy index stats because they're updated constantly, and both are optimized in favor up update rather than query.

Container relates to itself, so the numbers of rows is a simpler picture than the number of things the optimizer actually has to sort through as well.

TACC is into "big data" and might be interested in working on a solution to things like this, but I certainly wouldn't know what to put in a request or grant. I'd guess something like striping the data across many solid-state drives would be the most likely to pay off, but that's best addressed by an Oracle performance expert. Maybe paying someone for a system tune-up is a sellable proposal. I don't know if that would fix this or not, but it should improve overall performance.

This seems to be a relatively rare problem - it only happens when an upstream container type changes, and we don't seem to do that very much. There are probably other things that should trigger FLAT to refresh and don't - that's sort of expected when trying to squish a complex system into an overgrown spreadsheet. It's not exactly a fix, but just refreshing FLAT when we have spare processors would probably make this unnoticeable, and would fix anything else that should have been recached but wasn't. TACC is moving us to Oracle 12 now, and it has some additional priority settings that we can probably use to get at "when we have spare processors."

If that's not sufficient, I can take a closer look at the "great deal of expense/complexity" option. Oracle 12 should make that easier as well - it "speaks" the language in which we store the cached data, which minimally should make it easier to tell when something's out of sync.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ArctosDB/arctos/issues/1945#issuecomment-467113805, or mute the thread https://github.com/notifications/unsubscribe-auth/AOH0hL4-cZEqUXDU1mLEvSy0NJ7EjYdZks5vRCUXgaJpZM4bIZZC .

dustymc commented 5 years ago

more processors

Yea, I'm probably too sloppy with that terminology. Throwing hardware at things is almost always the cheapest and easiest option, and sometimes it works. The last time we paid for an Oracle consultation, and that was a LONG time ago, it turned out what we needed was more, smaller disks that could be read simultaneously (and adding them was like flipping the "go fast" switch). I suppose my perspective flops around depending on what's causing problems at the moment, but if we were going to spend $$ on one performance-related thing, an Oracle consultant is probably the single best place to do so.

Jegelewicz commented 2 years ago

@dustymc is this still a problem?

Jegelewicz commented 2 years ago

@dustymc can we close this?

dustymc commented 2 years ago

No, I still don't have a solution.

Jegelewicz commented 2 years ago

Can I get a better description of the problem?

dustymc commented 2 years ago

Calling this next release - I don't have a real solution, but Arctos should now be using extra CPU to process the oldest records in flat, which will (hopefully) result in things like this being picked up before they are noticed.

ArctosDB / arctos

container updates aren't propagating to FLAT #1945

COUNT(*)

COUNT(*)