cloudflare / cloudflare-gcp

Google Cloud Function to push json files from GC Storage to Big Query
Apache License 2.0
84 stars 49 forks source link

Data Quality Tolerance? #95

Open FStephenQuaratiello opened 2 years ago

FStephenQuaratiello commented 2 years ago

Hi,

I've been noticing a slight (~1%) discrepancy between the number of records imported to BigQuery with this tool, and the number of requests reported by the Cloudflare GraphQL API for a given time period. For example, the GraphQL API reports 46,532 requests in a given hour, but in BigQuery, there are only 45,736 records with an EdgeStartTimestamp in that hour. A small difference, to be sure, but a noticeable one.

Is this within expectations? And is there a better way to measure the health/quality of data imported by this tool?

shagamemnon commented 2 years ago

Hey @FStephenQuaratiello would you mind providing the GraphQL query and the BigQuery SQL query that you ran so we can investigate further?

FStephenQuaratiello commented 2 years ago

Sure thing:

BigQuery query:'''SELECT COUNT(*), EXTRACT(HOUR from EdgeStartTimestamp) AS hour FROM [TABLE] WHERE EdgeStartTimestamp > TIMESTAMP(DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY)) AND EdgeStartTimestamp < TIMESTAMP(CURRENT_DATE()) AND ARRAY_TO_STRING(ARRAY_REVERSE([ ARRAY_REVERSE(SPLIT(ClientRequestHost, "."))[ORDINAL(1)], ARRAY_REVERSE(SPLIT(ClientRequestHost, "."))[ORDINAL(2)] ]), ".") = '%s' GROUP BY hour ORDER BY hour'''

GraphQL query: """ query { viewer { zones(filter: {zoneTag: "%s"} ) { httpRequests1hGroups(limit:24, filter:{date: "%s"}) { sum { requests } dimensions { datetime } } } } } """

victor-perov commented 2 years ago

Hey @FStephenQuaratiello! Thanks for sharing your queries. I'm not particularly familiar with this tool, but I can help with a GQL part.

httpRequests1hGroups represents hourly aggregated eyeball requests. Therefore, if you want to compare with other sources, you should also make sure that you're counting eyeball requests. On top of it, if you're using "today" in your query, then I would expect that the last hour would not be "full" because the aggregation nature assumes buffering.