abossard / api-to-parquet

1 stars 0 forks source link

Time Series Data Processor

This is a Go application that processes time series data and saves it to a Parquet file. The application uses a cache to store previously processed data and avoid duplicate writes to the Parquet file.

Important: For the purpose of data ingestion, it's important that the file path contains the data source, YEAR, MONTH, DAY and HOUR. The filename itself can be anything, but needs to end in .parquet. Valid path: <data source>/2023/10/26/19/2023-10-11-19-123e4567-e89b-12d3-a456-426614174000.parquet. This will ensure efficient lookups.

Roadmap

Deployment instructions

Deploy To Azure

Visualize

After the deployment finished, you'll see the public URL with the API Key in the deployment output. Also the static IP of the container app environmnet.

For more detailed instructions and other scenarios check: setup.azcli

Shared Access Signature (SAS) Token

By default, the deployment adds an api key or shared access signature to protect the API. After the deployment you'll get the key in the apiKey output.

To use the key during the call, add a query parameter key to the url, e.g.:

GET https://api.mydomain.com/?key=abcdef0123456789 HTTP/1.1

GET: How to get the lastTimeGenerated and maxTimestamp

GET https://api.mydomain.com/?key=abcdef0123456789 HTTP/1.1

Will return the lastTimeGenerated property as well as the maxTimestamp, which is the highest timestamp that has even been sent to the API.

POST: How to upload new data

(the format of the post body is meant to be executed e.g. with HTTP Client in VS Code.)

POST https://api.mydomain.com/?key=abcdef0123456789 HTTP/1.1
content-type: application/json

{
    "content": [
        {
            "timestamp": {{$timestamp}},
            "value": {{$randomInt 1 43}},
            "timeOffsetHours": {{$randomInt 1 43}},
            "pointId": "{{$guid}}",
            "sequence": {{$randomInt 1 43}},
            "project": "{{$guid}}",
            "res": "{{$guid}}",
            "quality": {{$randomInt 1 43}}
        },
        {
            "timestamp": {{$timestamp}},
            "value": {{$randomInt 1 43}},
            "timeOffsetHours": {{$randomInt 1 43}},
            "pointId": "{{$guid}}",
            "sequence": {{$randomInt 1 43}},
            "project": "{{$guid}}",
            "res": "{{$guid}}",
            "quality": {{$randomInt 1 43}}
        },
        {
            "timestamp": {{$timestamp}},
            "value": {{$randomInt 1 43}},
            "timeOffsetHours": {{$randomInt 1 43}},
            "pointId": "{{$guid}}",
            "sequence": {{$randomInt 1 43}},
            "project": "{{$guid}}",
            "res": "{{$guid}}",
            "quality": {{$randomInt 1 43}}
        }
    ],
    "file": "<data source>/2023/10/26/19/{{$timestamp}}-{{$guid}}.parquet",
    "timeGenerated": {{$timestamp}},
    "id": "{{$guid}}"
}

Examples Synapse Query

SELECT TOP 100 *
FROM
    OPENROWSET(
        BULK 'https://ACCOUNTNAME.blob.core.windows.net/CONTAINERNAME/factory-1/2023/10/26/19/*.parquet',
        FORMAT='PARQUET'
    ) AS data

File Structure

The file structure of the project is as follows:

.
├── cache.go
├── go.mod
├── go.sum
├── main.go
├── README.md
└── time_series_data_processor_test.go

Usage

To use the application, you need to set the following environment variables:

Once you have set the environment variables, you can run the application using the following command:

cd src
go run .

The application will process the time series data and save it to a Parquet file. The cache will be used to avoid duplicate writes to the Parquet file.

API Security

It's possible to configure the API to require a secret token to be provided in the request header as key in the URL. This is done by setting the environment variable REQUIRE_API_KEY to a secret token. If this environment variable is not set, the API will not require a secret token.

On top of that you can always enable OIDC on the Azure Container App level.

Authentication to Azure Storage

The application uses the Azure SDK for Go to authenticate with Azure Storage. The SDK uses the Azure Default Credential Provider Chain to authenticate. This means e.g. locally it will use the Azure CLI to authenticate, and in Azure it will use Managed Identity. Please see the Azure SDK for Go documentation for more information.

License

This project is licensed under the MIT License - see the LICENSE file for details.