aodn / harvesters

Harvesters
GNU General Public License v3.0
0 stars 0 forks source link

AODN_WAVE harvester design #607

Closed bpasquer closed 6 years ago

bpasquer commented 6 years ago

Harvest of QLD delayed mode wave data using the CKAN data API: Access the QLD government database using CKAN API: allow to search and download datasets and (some) metadata by querying data in SQL, Python or JSONP. The JSONP solution has been chosen to be implemented in a TALEND harvester.

In the catalog, collections can be identified either by their names or their 'package_id’. A package consists of all data and some metadata from one site. The data is subdivided into resources usually corresponding to one year of data. Resources are identified by their resource_id. I have listed target datasets in a 'metadata' csv file read by the harvester (QLD_buoys_metadata.csv). Informations in the file are same as for the near-real time metadata csv file with the addition of package_name, longitude,latitude,package_id

Ex: Cairns, 55028, Department of Environment and Science, QLD, directional waverider buoy, 30minutes, coastal-data-system-waves-cairns, 145.7, -16.73, a34ae7af-1561-443c-ad58-c19766d8354c

In order to reduce the level of maintenance of the data collection, we need to design the harvester so that: 1) the existing data is updated if needed 2) new data is automatically added to the data collection

The metadata provided when querying a dataset provides information we can use to design the update. For example, a query to the API like the following will return these parameter ( amongst a long list of other): 'https://data.qld.gov.au/api/3/action/package_search?q=coastal-data-system-waves-cairns' id: faff952f-b91d-4ffe-811f-998e04f9e576 name: "Wave data - 2018" description: "2018 wave data from the Cairns site (January 1 - March 31)" last_modified: "2018-04-18T01:04:45.521941" revision_id: 14aca928-abec-4909-a2b7-aaad1f97c4ab

Information is also provided for resources, for example : ‘https://data.qld.gov.au/api/3/action/resource_show?id=9385fa18-eaf3-41cb-bf80-5fc2822fd786 package_id: "b2282253-e761-4d75-89ff-8f77cf43d501" last_modified: "2018-04-18T01:06:54.791621", name: "Wave data - 2018", revision_id: "cf438bea-35b0-4fcc-8cf7-1dc62a8801fe",

Updates: In the example above we can see that the 2018 dataset comprises data up to March 31 suggesting that datasets are updated regularly.

What kind of approach for dataset maintenance/update do we want to adopt regarding :

-Frequency of harvest: monthly, every 3month, 6month?

-Harvesting based on specific criteria or regular re-harvest of the whole dataset (and that implies deleting previous data)?

Data from previous years (historical data) could be considered static and not be re-harvested. Current year data could be regularly re-harvested in full, or re-harvested only if it's been updated. In this latter case which parameter is the best diagnostic to check whether a dataset has been updated or not? Check parameter "last_modified", check for a change in "revision_id" (assuming it changes at each update- a question I should ask actually)

Following solution has been implemented so far:

Issues with datasets:

The harvester is not finalized. The outstanding issues are:

jonescc commented 6 years ago

Some alternate approaches

we should probably be using the pipeline approach these days for consistency with other collections

Comments on your current approach

harvest_csv_metadata

harvest_metadata

    TalendDate.parseDate(inputDate.replaceFirst("(\\d{3})\\d*$", "$1+1000"), "yyyy-MM-dd'T'HH:mm:ss.SSSz")

       doesn't allow for daylight savings adjustments but probably OK for Queensland, or,

    TimeZone qldTZ = TimeZone.getTimeZone("Australia/Queensland");
    SimpleDateFormat format = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss");
    output_row.last_modified = format.parse(input_row.last_modified);

       which could be pulled out into a talend code routine so you could just use it in a tMap.

Data_Harvest

lbesnard commented 6 years ago

Current issues with the Queensland Wave dataset

Without taking into consideration which way is the best, here is a list of various issues I came across with the Queensland wave dataset (minus the ones already found by Bene). There might be more issues

issues in letting Talend do everything

Talend harvesters get quickly overly complicated when dealing with external web-services. The idea that a Talend harvester, running as a cron job, to run, download the data, clean, update... assumes that the external web-service and data is perfect.

This is unfortunately/fortunately not the case.

I'm of the opinion that, for ALL external web-services we retrieve data from and host, the Talend harvester is not the suitable tool to handle the whole process. It should be only created to process physical files.

All our system and tools are based around physical files:

As PO's, we don't have other tools or even credentials to deal with cleaning any data from the database(which is great in my opinion). If we used Talend for this, we are pretty much locked in.

Using talend to do everything means POs actually can't do anything with the data. It becomes almost impossible update/remove any data. Any full reprocess of the data becomes really complicated with any physical files to re-harvest. We sometimes deal with NRT web-services which remove their old on-going data which we have decided to keep. In this case, a data reprocess would be challenging. When dealing with physical files, everything else becomes easier

Finally, it is also extremely harder to review a talend harvester than a python script

recommended design for all external web-services data retrieve

harvest the data from an external web-service -> PYTHON

Python has all the toolboxes(pandas, numpy...) possible to quickly write some code to download and read data from :

in various formats:

Many web-services would fail when they are triggered many times too quickly. This is easily handled with Python by adding a retry decorator to a download function in order to retry the downloading a defined amount of times.

cleaning the dataset -> PYTHON

The data we collect from external contributors is never perfect. It is full of inconsistencies and this will always be the case.

Writing any logic in java to handle those various cases as stated above becomes complicated and extremely time consuming for us PO's. it's a matter of days vs 10 minutes ratio. And it will also be, most likely, poorly written. Debugging is also rather complicated in Talend.

creating physical files -> PYTHON

We are currently in the process of writing a common NetCDF generator from Json files. The process will even be easier.

Harvesting the data to our database -> Pipeline v2 -> Talend

Another benefit of using the pipeline is that we can use the IOOS checker for find more issues with the data without blindly commit it to the database. We have also talked in the past of creating a checker on the actual data values, as well as an indexing database. If we want to reduce the amount of harvesters, this is also a better way to go

bpasquer commented 6 years ago

The solution proposed by Laurent has been implemented.