aodn / harvesters

Harvesters
GNU General Public License v3.0
0 stars 0 forks source link

Views get dropped when harvesters are re-deployed #52

Open xhoenner opened 10 years ago

xhoenner commented 10 years ago

Issue

Project officers occasionally have to modify existing harvesters to incorporate new features (e.g. new Talend components) or standardise the data processing workflow (e.g. timestamp with time zone). If changes were made to the Liquibase component used in a Talend harvester, the corresponding database schema in rc and prod has to be dropped prior to re-deploying the harvester. Dropping the schema and its entire content means that 'external views' (views outside the harvester's schema) querying tables and views inside the harvester's schema will get dropped in the process too.

Why should we care?

This drop cascade particularly affects (1) the reporting views that Jacqui uses to create data reports for the IMOS office, and (2) the _metadata_summary views used for the CSV metadata header. While those views can be recreated easily, I am not always aware when a harvester will be re-deployed, which means that a particular view might be unavailable for quite some time.

Suggested solutions

1/ We could create reporting and metadata_summary tables instead of views, but these tables would need to be updated regularly, which is not practical. 2/ Incorporating all these views either in the corresponding harvester (e.g. the aatams_sattag_nrt reporting and metadata_summary views could be created in the aatams_sattag_nrt harvester) so that these views are created or replaced every time the harvester runs. 3/ Alternatively we could have a script creating all the reporting and metadata_summary views into a single, separate harvester that would run daily. I suspect that might be the best option as I've had to transform some of the reporting views into tables (e.g. for ACORN and Argo) due to queries taking too long to run for Jacqui. The disadvantage here is that such a harvester would need to be maintained, and that any change to it will require a PR, plus being deployed on rc, then in prod, i.e. quite a lengthy process. 4/ Materialising views: we haven't explored that possibility in the past so I'm not really sure if that's a possibility here.

julian1 commented 10 years ago

How about using chef to deploy the reporting (already versioned on github) + metadata views, in the same fashion that the imos extensions are already deployed?

Since dropping the schema, is normally followed by a chef provision to re-create it with appropriate permissions etc, the views would also be updated in correct sequence with any new harvester deployment.

Aligning the management of imo- extension sql with other sources of database sql code is a useful goal generally - since it eliminates duplicate/redudant sql code management and improves automation.

julian1 commented 10 years ago

In fact using chef to deploy view code would not be a complete fix.

This is because the sequence of populating the underlying schema occurs when the harvester runs its liquibase update, instead of during harvester deployment and schema creation time.

The steps are reasonably proximate in time though. So if the chef db provisioning was scheduled to run every morning, this would reduce the period that reporting/download metadata views were not available.

Another approach might be to hook the harvester liquibase scripts so they could be run during provisioning.

Edit,

mhidas commented 10 years ago

To me it would make most sense to keep everything relating to a data collection together in one place. By this I mean that views for reporting and metadata summaries relating to the collection should be in the same schema as the metadata and data, and they should be defined in the liquipase script run by the harvester. This way, when things change, you only have to change them in one place, and everything gets updated when the harvester runs.

(Of course the current situation for both ANMN and ABOS is different, because we have dedicated generic metadata harvesters used for reporting, independent of the data harvesters used for creating portal data collections. I think there are good reasons for keeping these as they are, but that's a separate discussion...)

xhoenner commented 10 years ago

@mhidas Ideally yes, I agree that having the reporting and metadata_summary views within data collection schemas would be neater, however that would mean pointing @jachope 's iReport thing and the Geoserver csv header plugin to separate schemas, which is quite impractical.

@julian1 'if instead of dropping the underlying schema only the contained db objects were dropped.' --> the problem is not so much dropping the schema per se, but more dropping the tables and views used by my reporting queries.

To me it looks like the only way around that dropping cascade issue is to have a harvester running on a daily basis to create or replace views and update materialised tables. The issue with that option is that any change to a SQL query would imply getting someone to review and re-deploy the harvester, which commonly takes several days if not more, and reporting is quite often an urgent matter. Would it be possible to run the harvester using Chef daily, and in case of urgency/errors I would run the harvester from my machine to populate dbprod?

julian1 commented 10 years ago

@xhoenner The view code is already in github and you have a free-hand to commit changes there. To my mind it's chef that should deploy it, rather than a harvester. That's the model already chosen to deploy stand-alone sql code such as postgis support and imos extension code.

Whatever solution is adopted, we need to ensure it's easy to continue prototyping reporting code/ without having to wait around for scheduled updates.

Also there's a need to ensure that the materialized views can be easily created. I'd like to see a function that can be run by hand either by you or Jac just before compiling reports. It ought to be as easy as select create reporting.matview(); then running iReports.

xhoenner commented 10 years ago

That sounds like a good option to me @julian1, what do we need to do to get that implemented?

mhidas commented 9 years ago

Have we got any further in this since last year?

These days we don't drop entire schemas very often, but things do still change sometimes and even dropping a single view can lead to the same issue discussed above.

It looks to me like the metadata_summary views are not really an issue, as they don't directly reference any tables/views in other schemas. So this issue is mainly about the reporting views. They're only really used once a month. Do we need to have them there all the time? Could we just create them, run the reports, then drop them for the rest of the month?

(Or, if we want to make sure we catch any problems due to things that have changed, create them every morning, just to check that they still all work, then drop them straight away unless we're doing reports.)

xhoenner commented 9 years ago

It looks to me like the metadata_summary views are not really an issue, as they don't directly reference any tables/views in other schemas.

Some of the AATAMS metadata_summary views reference tables/views in other schemas to provide metadata on each individual deployment. See parameters_mapping.aatams_sattag_nrt_metadata_summary for instance.

Do we need to have the reporting views there all the time?

No, we could drop the reporting views once @jachope is done producing the reports.