HakaiInstitute / hakai-datasets

Hakai Datasets that are going into https://catalogue.hakai.org/erddap/
0 stars 0 forks source link

replace erddap schemas tables by views #115

Closed JessyBarrette closed 1 year ago

JessyBarrette commented 1 year ago

The original configuration replaced yesterday was generating TABLES.

This present will generate VIEWS of the similar result.

Questions related:

  1. Do we need to drop the already existing Tables as available on hakai db?
  2. Any idea how this will affect the response time of a query to those tables?
n-a-t-e commented 1 year ago

I'm not where you are trying this code, but as far as I can see Postgres doesn't have syntax like CREATE OR REPLACE for tables

n-a-t-e commented 1 year ago

Also, these are tables and not views for a reason, mostly because:

fostermh commented 1 year ago

@n-a-t-e has some good points.

there is a

DROP TABLE foo;
CREATE TABLE IF NOT EXISTS foo;

but no REPLACE TABLE option in Postgres, unfortunately.

Views do not have indexing of their own and rely on the indexing of the underlying tables. Likewise, all the joins are processed at the time of query which in the case of CTD data can take a significant amount of time. Obviously, I have not run any benchmarks to see how ERDDAP performs querying a view rather than a table but my initial thought is that ERDDAP will perform much better if querying a table and scale better under load.

One option here would be a MATERIALIZED VIEW which is a view that is saved to disk and looks a lot like a table. You would however be coupled to the EIMS schema as Nate pointed out. Materialized views also still have to be rebuilt periodically so there is no real advantage of an MV over a table in this case. You would still have to reprocess the data nightly or whatever the current schedule for recreating the tables is. Downtime is the only advantage I see to a MV. When rebuilding an MV the current view exists until the new one is built and then swapped for the existing MV. If downtime is a consideration then an MV might make sense.

My last thought on this is around partitioning. For very large tables, it is possible to partition the underlying table so that sections of the table are stored in different places on disk. Splitting the table reduces index size and potentially makes reads much faster. For example, if we noticed that most people look at the last year of data then we could partition the ctd table by year. Then when a query comes in the db pulls the full partition from disk which is faster than searching the index, pulling multiple pages and discarding some of the data within. Given that ERDDAP caches data and requests I expect this level of optimization is unnecessary however.

In short, my advice is to stick with tables and use indexing where appropriate. :-)

n-a-t-e commented 1 year ago

We had used materialized views in the past for this, it doesn't work with the eims rebuilds which is why this is using tables. The tables are working perfectly, and they aren't big enough to justify partitioning. No performance issues, and there is no specific column to index that I can see, as there is no 'typical query' to these datasets.