GeoSensorWebLab / data-transloader

Download and convert weather data for SensorThings API
GNU General Public License v3.0
3 stars 1 forks source link

Observations from Data Garrison fail to upload #1

Closed openfirmware closed 5 years ago

openfirmware commented 5 years ago

The ETL using Airflow is properly downloading Observations from Data Garrison, as the readings are cached in the filesystem. However, no Observations have been uploaded since early October.

This seems to be caused by the ETL running an upload task every hour that only uploads from the last 60 minutes, while the Data Garrison station only publishes every 120 minutes (which contains observations at 15-minute intervals). By default, DAGs in Apache Airflow that run every 60 minutes will only send a 60 minute time interval to the "upload" ETL task.

Here is an example where Data Garrison publishes every 2 hours: 12:50, 14:50, 16:50, and so on. If the DAG runs at 13:01 UTC, then the "download" task may download observations for 12:00, 12:15, 12:30, and 12:45 (The download task does not use interval downloading and does not miss downloading observations.) At 13:15 UTC1, Airflow runs the "upload" task in the DAG, using the interval 12:01 to 13:01 which only uploads 3 of the 4 observations.

At 14:01 UTC, no new observations are downloaded as the weather station has not pushed anything to Data Garrison. The "download" task finishes successfully and doesn't add any new observations to the filesystem cache. As that task succeeded, the "upload" task runs for the interval of 13:01 to 14:01 and uploads nothing, as nothing new is in the filesystem cache for that time interval.

At 15:01 UTC, more observations are available for download. The "download" task retrieves observations for 13:00, 13:15, 13:30, 13:45, 14:00, 14:15, 14:30, and 14:45 for a total of 8 new observations. The upload task then runs and uploads observations for 14:01 to 15:01, only uploading 3 of the 8 observations.

As this continues, the observations uploaded are intermittent and will show gaps in the dashboard chart display. However the filesystem cache is completely correct, and manually running the ETL on the server with a wide upload interval does temporarily fix the issue.

The real solution is updating the DAG to upload observations over a longer time interval, perhaps up to 6 hours? This should not affect the schema of the STA server, as the ETL client will check for existing Observations on each upload and do a merge/insert if necessary.

To debug issues like this, the "download" and "upload" tools should print out to console the number of Observations that were downloaded or Uploaded after running. The uploader could even specify the number of merges done. I may look into Munin integration so there is a chart that shows the number of downloads/uploads over time, as the count should be steady over a longer interval and any deviation can trigger an error in Munin instead.


1: As there are over a hundred DAGs, Airflow can take a few minutes between running the "download" and "upload" tasks for a single DAG. This may become a bigger issue later as data starts to become less and less "near real-time". I may need to provision more cloud resources to handle this.

openfirmware commented 5 years ago

The logger for the ETL will now output the count for Observations downloaded and uploaded when the LOG_LEVEL is INFO. Upload counts are also broken down into:

This logging information must be manually reviewed to find any issues. The logs can be read in Airflow DAG run logs, or from the directory where logs are stored on the production server (/srv/logs, in our case).

openfirmware commented 5 years ago

I updated the DAG template for Apache Airflow to always upload 24 hours worth of data, for Data Garrison station data. This should ensure observations are always uploaded. The downside is a bit longer processing time to verify which Observation entities exist in SensorThings API, but for a single station this only takes about a minute (with an externally ran STA).