AODN_WAVE harvester design

bpasquer commented 6 years ago

Harvest of QLD delayed mode wave data using the CKAN data API: Access the QLD government database using CKAN API: allow to search and download datasets and (some) metadata by querying data in SQL, Python or JSONP. The JSONP solution has been chosen to be implemented in a TALEND harvester.

In the catalog, collections can be identified either by their names or their 'package_id’. A package consists of all data and some metadata from one site. The data is subdivided into resources usually corresponding to one year of data. Resources are identified by their resource_id. I have listed target datasets in a 'metadata' csv file read by the harvester (QLD_buoys_metadata.csv). Informations in the file are same as for the near-real time metadata csv file with the addition of package_name, longitude,latitude,package_id

Ex: Cairns, 55028, Department of Environment and Science, QLD, directional waverider buoy, 30minutes, coastal-data-system-waves-cairns, 145.7, -16.73, a34ae7af-1561-443c-ad58-c19766d8354c

In order to reduce the level of maintenance of the data collection, we need to design the harvester so that: 1) the existing data is updated if needed 2) new data is automatically added to the data collection

The metadata provided when querying a dataset provides information we can use to design the update. For example, a query to the API like the following will return these parameter ( amongst a long list of other): 'https://data.qld.gov.au/api/3/action/package_search?q=coastal-data-system-waves-cairns' id: faff952f-b91d-4ffe-811f-998e04f9e576 name: "Wave data - 2018" description: "2018 wave data from the Cairns site (January 1 - March 31)" last_modified: "2018-04-18T01:04:45.521941" revision_id: 14aca928-abec-4909-a2b7-aaad1f97c4ab

Information is also provided for resources, for example : ‘https://data.qld.gov.au/api/3/action/resource_show?id=9385fa18-eaf3-41cb-bf80-5fc2822fd786 package_id: "b2282253-e761-4d75-89ff-8f77cf43d501" last_modified: "2018-04-18T01:06:54.791621", name: "Wave data - 2018", revision_id: "cf438bea-35b0-4fcc-8cf7-1dc62a8801fe",

Updates: In the example above we can see that the 2018 dataset comprises data up to March 31 suggesting that datasets are updated regularly.

What kind of approach for dataset maintenance/update do we want to adopt regarding :

-Frequency of harvest: monthly, every 3month, 6month?

-Harvesting based on specific criteria or regular re-harvest of the whole dataset (and that implies deleting previous data)?

Data from previous years (historical data) could be considered static and not be re-harvested. Current year data could be regularly re-harvested in full, or re-harvested only if it's been updated. In this latter case which parameter is the best diagnostic to check whether a dataset has been updated or not? Check parameter "last_modified", check for a change in "revision_id" (assuming it changes at each update- a question I should ask actually)

Following solution has been implemented so far:

Data update is conditional: Test on table deployments: Run harvest only for deployment_id for which last_modified date has changed. A Boolean variable ”modified” is set to true or false according to the test result.

Issues with datasets:

Time zone: I contacted the Coastal Wave Unit to confirmed that time is given in AEST (UTC+10 since no clear statement is made anywhere in the metadata, data or on the website
Location: The location is not provided in the dataset nor in the metadata. The location is provided in the metadata csv file, but we got no way to automatically update the location if they were changing. Also, we are missing information on deployment location history => Need to contact Coastal Wave Unit

The harvester is not finalized. The outstanding issues are:

timezone: Conversion from AEST to UTC is done by extracting 10 hours. Not proper way. I have found a java code snippet to do that properly but went for the easy solution so far.
comparison of current and stored last_modified date is not working properly. Possibly(probably?) a design issue.
SST and wave direction are stored as text instead of numeric in the older dataset (can be up to 30 years old or more). Need to check data type and convert from text to numeric before harvesting ( not done yet)

jonescc commented 6 years ago

Some alternate approaches

detect new/modified/deleted package/resources using cron/pipeline and only use harvester to delete add or modify information as per other harvesters
use old harvesting components instead of manifest components to keep track of new/modified/deleted packages/resources rather than trying to keep track of it yourself

we should probably be using the pipeline approach these days for consistency with other collections

Comments on your current approach

harvest_csv_metadata

doesn't currently handle deleted packages - but perhaps that isn't required
should frequence be frequency? no such english word as frequence
had an error loading the metadata due to the encoding setting on the tFileInputDelimited component - I set it to 'UTF-8' but if you want this pased in as a parameter you'll need to add it to nthe possible context parameters
perhaps consider skipping empty rows/trimming columns - I had to play around with before I could get some valid data in but then that's because I didn't know what format was required

harvest_metadata

not fault tolerant at the moment - if harvest_metadata runs successfully but the harvester fails after this - the next time the harvester is run the modified flag will be reset to false and modifications not processed - commit should be around full transaction, old harvesting components used which handle this or some other approach taken
won't handle deleted resources if that's possible
need to put single quotes around string values in tpostgresqlrow_3 - although see below re this needing to be a tpostgresqlinput
need to use tpostgresqlinput instead of tpostgresqlrow for lookup - input will actually return rows - this is why your last mod date comparison isn't working.
timezone handling is a bit horrible in java 6/7 but a couple of ways of setting the correct timezone are by passing a time offset when parsing the string e.g.

    TalendDate.parseDate(inputDate.replaceFirst("(\\d{3})\\d*$", "$1+1000"), "yyyy-MM-dd'T'HH:mm:ss.SSSz")

doesn't allow for daylight savings adjustments but probably OK for Queensland, or,

    TimeZone qldTZ = TimeZone.getTimeZone("Australia/Queensland");
    SimpleDateFormat format = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss");
    output_row.last_modified = format.parse(input_row.last_modified);

which could be pulled out into a talend code routine so you could just use it in a tMap.

Data_Harvest

modified is never set to false anywhere
assuming you're intending on removing the limit parameter
could just set max and min values directly rather than use with clause

lbesnard commented 6 years ago

Current issues with the Queensland Wave dataset

Without taking into consideration which way is the best, here is a list of various issues I came across with the Queensland wave dataset (minus the ones already found by Bene). There might be more issues

empty resource/data
- https://data.qld.gov.au/api/3/action/datastore_search?resource_id=5436c7e6-4d34-4968-97aa-8f105a64849a
inconsistent variable names
- "id": "Date/ time
- "id": "Date/time
ERROR - Service unavailable.
inconsistent fillvalues
- '-99.9' most data
- '-9999' https://data.qld.gov.au/api/3/action/datastore_search?resource_id=e323539b-6682-4d9e-a2ac-c3c07d63875a
- no fillvalue, but empty cells
- ... others but not entirely sure yet
wrong values
- variable set to 0 https://data.qld.gov.au/api/3/action/datastore_search?resource_id=c82ee7d3-25e9-4642-819d-67c443929c90
(not sure) data from 1976 ? https://data.qld.gov.au/api/3/action/datastore_search?resource_id=dc366f6a-957c-4d47-b582-05768f3c02b9
Not all resources are Wave resources. We also have Current data
- https://data.qld.gov.au/api/3/action/datastore_search?resource_id=bd8a6ec4-19bc-4557-8696-ce46ad307845
Time values can be NaN, and not ordered
external API sets a default limit of 100 values. It is necessary to add a &limit=1000000 suffix to retrieve all data
Time is not UTC

issues in letting Talend do everything

Talend harvesters get quickly overly complicated when dealing with external web-services. The idea that a Talend harvester, running as a cron job, to run, download the data, clean, update... assumes that the external web-service and data is perfect.

This is unfortunately/fortunately not the case.

I'm of the opinion that, for ALL external web-services we retrieve data from and host, the Talend harvester is not the suitable tool to handle the whole process. It should be only created to process physical files.

All our system and tools are based around physical files:

If we want to add a new file, or update it, this goes via a pipeline, a rather powerful tool. Talend will easily update the data on the database side.
if we want to delete some data, again, we just run a po_s3_del command to remove any file from the database

As PO's, we don't have other tools or even credentials to deal with cleaning any data from the database(which is great in my opinion). If we used Talend for this, we are pretty much locked in.

Using talend to do everything means POs actually can't do anything with the data. It becomes almost impossible update/remove any data. Any full reprocess of the data becomes really complicated with any physical files to re-harvest. We sometimes deal with NRT web-services which remove their old on-going data which we have decided to keep. In this case, a data reprocess would be challenging. When dealing with physical files, everything else becomes easier

Finally, it is also extremely harder to review a talend harvester than a python script

recommended design for all external web-services data retrieve

harvest the data from an external web-service -> PYTHON

Python has all the toolboxes(pandas, numpy...) possible to quickly write some code to download and read data from :

wfs
http(s) request
...

in various formats:

json
xml
text
...

Many web-services would fail when they are triggered many times too quickly. This is easily handled with Python by adding a retry decorator to a download function in order to retry the downloading a defined amount of times.

cleaning the dataset -> PYTHON

The data we collect from external contributors is never perfect. It is full of inconsistencies and this will always be the case.

variable name/case changing
varying fill values
empty variables
bad values
dealing with various timezones
inconsistent time format
...

Writing any logic in java to handle those various cases as stated above becomes complicated and extremely time consuming for us PO's. it's a matter of days vs 10 minutes ratio. And it will also be, most likely, poorly written. Debugging is also rather complicated in Talend.

creating physical files -> PYTHON

We are currently in the process of writing a common NetCDF generator from Json files. The process will even be easier.

Harvesting the data to our database -> Pipeline v2 -> Talend

Another benefit of using the pipeline is that we can use the IOOS checker for find more issues with the data without blindly commit it to the database. We have also talked in the past of creating a checker on the actual data values, as well as an indexing database. If we want to reduce the amount of harvesters, this is also a better way to go

bpasquer commented 6 years ago

The solution proposed by Laurent has been implemented.

aodn / harvesters