flock-lab / flock

Flock: A Low-Cost Streaming Query Engine on FaaS Platforms
https://flock-lab.github.io/flock/
GNU Affero General Public License v3.0
287 stars 39 forks source link

Data Source #35

Closed gangliao closed 3 years ago

gangliao commented 3 years ago
  1. New York City Taxi & Limousine Commission Trip Record Data Link

The yellow and green taxi trip records include fields capturing pick-up and drop-off dates/times, pick-up and drop-off locations, trip distances, itemized fares, rate types, payment types, and driver-reported passenger counts. The data is provided in CSV format.

US Accidents (3.5 million records) [Link]

This is a countrywide car accident dataset, which covers 49 states of the USA. The accident data are collected from February 2016 to June 2020, using two APIs that provide streaming traffic incident (or event) data. These APIs broadcast traffic data captured by a variety of entities, such as the US and state departments of transportation, law enforcement agencies, traffic cameras, and traffic sensors within the road-networks. Currently, there are about 3.5 million accident records in this dataset.

The customer is applying a continuous filter to only retain records of interest.

gangliao commented 3 years ago
  1. City of Baltimore Crime Data [Link]

CrimeDate | CrimeTime | CrimeCode | Location | Description | Inside/Outside | Weapon | Post | District | Neighborhood | Longitude | Latitude | Location 1 | Premise | vri_name1 | Total Incidents

We can run a CQ after every hour

gangliao commented 3 years ago
  1. E-Commerce Data [Link]

Typically e-commerce datasets are proprietary and consequently hard to find among publicly available data. However, The UCI Machine Learning Repository has made this dataset containing actual transactions from 2010 and 2011. The dataset is maintained on their site, where it can be found by the title "Online Retail".

CQ Example:

  1. A customer uses a SQL to compute a 1-minute, sliding-window sum of items sold in online shopping transactions captured in the stream.

  2. group by

select stream productId,
 floor(rowtime to hour) as rowtime,
 sum(units) as u,
 count(*) as c
from Orders
group by productId,
 floor(rowtime to hour)

The "pie chart" problem:

select productId, count(*)
from Orders
where rowtime > current_timestamp - interval ‘1’ hour
group by productId
  1. filter
select stream *
from Orders
where units > 1000
  1. join

join streams if the join condition forces them into “lock step”, within a window (in this case, 1 hour).

select stream *
from Orders as o
join Shipments as s
on o.productId = p.productId
and s.rowtime
 between o.rowtime
 and o.rowtime + interval ‘1’ hour

Reference

  1. Julian Hyde. Streaming SQL. https://www-conf.slac.stanford.edu/xldb2016/talks/published/Weds_9_Hyde_calcite-streaming-sql-xldb-2016.pdf
gangliao commented 3 years ago
  1. Azure Functions Trace 2019 [Link]

CQ Example: Streaming ETL

A customer uses CQ to continuously transform and deliver log to the object storage. The log data is transformed using several operators including applying a schema to the different log events, partitioning data by event type, sorting data by timestamp, and buffering data for one hour prior to delivery. The application has many transformation steps but none are computationally intensive.