New download service and format

avaldebe commented 4 weeks ago

https://discomap.eea.europa.eu/map/fme/AirQualityExport.htm says

Update 02.05.2024: This service to be discontinued end of 2024 and are replaced by https://eeadmz1-downloads-webapp.azurewebsites.net/

The new service provides observations as parquet files via a new api. The new api provides entry points for:

list of URLs to individual parquet files
zip file containing all relevant parquet files

Each parquet files contains all available observations for the station/pollutant/dataset(Airbase, E1a, E2a)

I'm not sure if the can be downloaded directly , or it can only be retrieved from the metadata page

The new service and metadata page seem to be easy to use for most casual users, as long as they request for a single parquet file.

For CLI use, the parameters to the new api are given as part of the body of the request, so an specially crafted curl is needed. The request is complex enough that this package is not obsolete for CLI usage for casual users.

Here is an example from the documentation:

import requests
apiUrl = "https://eeadmz1-downloads-api-appservice.azurewebsites.net/"
endpoint = "ParquetFile"
downloadPath = "localPath\\"
fileName = "download_data.zip"
# Request body
request_body = {
"countries":["ES"],
"cities":["Madrid"],
"properties":[],
"datasets":[1,2],
"source":"Api"
}
# A get request to the API
downloadFile = requests.post(apiUrl+endpoint,
json=request_body).content
# Store in local path
output = open(downloadPath+fileName, 'wb')
output.write(downloadFile)

the equivalent curl would be

$ curl -X 'POST' \
  'https://eeadmz1-downloads-api-appservice.azurewebsites.net/ParquetFile' \
  -H 'accept: text/plain' \
  -H 'Content-Type: application/json' \
  -d '{
  "countries": ["ES"],
  "cities": ["Madrid"],
  "properties": [],
  "datasets": [1,2],
  "source": "Api"
}' \
  --output download_data.zip

Accodrinng to the documentation:

datasets: value of the datasets separated by commas:
1. Unverified data transmitted continuously (Up-To-Date/UTD/E2a) data from the
beginning of 2023.
2. Verified data (E1a) from 2013 to 2022 reported by countries by 30 September each
year for the previous year.
3. Historical Airbase data delivered between 2002 and 2012 before Air Quality
Directive 2008/50/EC entered into force.

avaldebe commented 4 weeks ago

@JohnPaton what do you think? Do you want to support the new service?

avaldebe commented 3 weeks ago

large requests are likely to fail

$ curl -X 'POST' \
  'https://eeadmz1-downloads-api-appservice.azurewebsites.net/ParquetFile' \
  -H 'accept: */*' \
  -H 'Content-Type: application/json' \
  -d '{
  "countries": [],
  "cities": [],
  "properties": ["SO2", "PM10", "O3", "NO2", "CO", "NO", "PM2.5"],
  "datasets": [1],
  "source": "API"
}' \
  -o download_data.zip
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   213    0    71  100   142      2      4  0:00:35  0:00:30  0:00:05    20
$ cat download_data.zip 
An exception has been raised that is likely due to a transient failure.

Splitting this request by pollutant and country fails with the same error. Requesting the list of file URLs seems and retrieve each file independently seems to be the only reliable way to handle large requests.

JohnPaton commented 3 weeks ago

Data extracts are limited to 300MB. If more is needed, please use the "List of URLs" checkbox to download the data afterwards.

Since we're mostly targeting bulk downloads, it seems indeed providing a wrapper that retrieves the list of URLs and grabs the files will be the way to go. I'd suggest we can release a new version that supports the new API and spits out warnings for usage of the old one, with links to the new service and new downloader.

Luckily the list-of-URLs is a similar setup so it shouldn't be too much work to convert.

JohnPaton commented 3 weeks ago

Do you know if the new data is equivalent (aside from the file format)?

JohnPaton commented 3 weeks ago

I'm actually not able to make the new service work at all, all I'm getting are 500s for any combination of parameters I try.

avaldebe commented 3 weeks ago

Do you know if the new data is equivalent (aside from the file format)?

$ pqi schema SP_28092005_7_8.parquet
Samplingpoint: string
Pollutant: int32
Start: timestamp[ns] not null
End: timestamp[ns] not null
Value: decimal128(38, 18)
Unit: string
AggType: string
Validity: int32 not null
Verification: int32 not null
ResultTime: timestamp[ns] not null
DataCapture: decimal128(38, 18)
FkObservationLog: string

Chapter 3 of the docs describe the parquet schema in more detail. However, this very important detail is only mentioned on the service page

The measurement start and end time indicated the Parquet files for hourly data and variable (var) measurements are converted to UTC+1 time zone, daily values are instead delivered in the time zone reported by countries.

Also, note that the Value does not support null/nan values. Therefore one needs to filter out the non valid observations using the Validity

avaldebe commented 3 weeks ago

I'm actually not able to make the new service work at all, all I'm getting are 500s for any combination of parameters I try.

It goes down now and then. It seems that they the back-end does not limit the size of the requests and breaks down or time out after a large request.

It is possible to check the size of the request using the DownloadSummary entry point, like this

$ curl -X 'POST' \
  'https://eeadmz1-downloads-api-appservice.azurewebsites.net/DownloadSummary' \ 
  -H 'accept: text/plain' \
  -H 'Content-Type: application/json' \
  -d '{"countries": ["NO"], "cities": ["Oslo"], "properties": ["NO2"], "datasets": [1], "source": "API"}'
{"numberFiles":68,"size":25}

But it will time out for a request large enough

$ curl -X 'POST' \
  'https://eeadmz1-downloads-api-appservice.azurewebsites.net/DownloadSummary' \
  -H 'accept: text/plain' \
  -H 'Content-Type: application/json' \
  -d '{ "countries": [], "cities": [], "properties": [], "datasets": [1, 2, 3], "source": "API"}'
<html><head><title>500 - The request timed out.</title></head><body>  <font color ="#aa0000">         <h2>500 - The request timed out.</h2></font>  The web server failed to respond within the specified time.</body></html>

I had similar experiences with the ParquetFile/urls entry point, with error messages like

An exception has been raised that is likely due to a transient failure.

and


An error occurred while writing to logger(s). (Exception of type 'System.OutOfMemoryException' was thrown.)```

avaldebe commented 3 weeks ago

Data extracts are limited to 300MB. If more is needed, please use the "List of URLs" checkbox to download the data afterwards.

Since we're mostly targeting bulk downloads, it seems indeed providing a wrapper that retrieves the list of URLs and grabs the files will be the way to go. I'd suggest we can release a new version that supports the new API and spits out warnings for usage of the old one, with links to the new service and new downloader.

Luckily the list-of-URLs is a similar setup so it shouldn't be too much work to convert.

I'll prepare a PR this week. I would like to remove the deprecated CLI sub commands and offer the 3 distinct datasets as separate sub commands (e.g. aribase, E1a and E2a or historical, verified and unverified) and maybe rename the CLI to aq-download or something similar.

Historical Airbase data delivered between 2002 and 2012 before Air Quality Directive 2008/50/EC entered into force.

Verified data (E1a) from 2013 to 2022 reported by countries by 30 September each year for the previous year.

Unverified data transmitted continuously (Up To Date/UTD/E2a) data from the beginning of 2023.

JohnPaton commented 3 weeks ago

Great. If the data is equivalent, we could maybe do the following:

Bump up to v1 and remove deprecated methods
Convert the existing client to use the new service
Provide a utility method and/or option to convert downloaded parquets to csv (for backwards compatibility)

avaldebe commented 1 week ago

with a bit of experimentation I managed to get the station metadata with the following command

wget 'https://discomap.eea.europa.eu/App/AQViewer/download?fqn=Airquality_Dissem.b2g.measurements&f=csv' \
    -O DataExtract.csv.zip

The metadata page issues a more complicated POST request, but an empty post request works

curl -X 'POST' \
  'https://discomap.eea.europa.eu/App/AQViewer/download?fqn=Airquality_Dissem.b2g.measurements&f=csv' \
  -d '{}' \
  -o DataExtract.csv.zip

and a plain GET request also works

curl -X 'GET' \
  'https://discomap.eea.europa.eu/App/AQViewer/download?fqn=Airquality_Dissem.b2g.measurements&f=csv' \
  -o DataExtract.csv.zip

JohnPaton / airbase

New download service and format #52