Support multiple zone directories in single bucket, additional PubSub mechanism to provide near-realtime file detection and pipelining

gabenp commented 3 years ago

We had also recently run into the issue that was fixed in #70

However, the re-factoring of the code as part of that broke our particular setup which is:

Single GCS Bucket only used for Cloudflare Logpush
Multiple directories per zone name

Given the re-factoring in #70 we would have had to either, split the logs into different buckets, or deploy multiple functions per-subdirectory. However, doing either of those things would have re-introduced the BigQuery quota issue since we have enough zones that we would have exceeded the limit even on a per-minute schedule per-zone.

What I ended up doing here was re-introducing the "old" methodology of having a Cloud Function triggered by the GCS storage bucket finalize event, but we instead dump the detected filename into an additional PubSub topic so we can then separately pull all the newly detected files and load them into BigQuery immediately.

This also removes the (in my opinion - janky) deadlineDate/deadlineDt logic that was also introduced.

This code may not be perfect for all use cases but I wanted to open this PR as an example for others, and perhaps with some tweaks we could make this the standard setup and merge into master?

Overall, I think the "lookback" deadline time logic is not ideal, and having a more realtime "pipeline" using separate cloud functions and PubSub seems to be more logical IMO.

Basic Overview of differences:

Create 2 PubSub Topics: cf_logs_files and every_minute
Cloud Schedule every minute trigger on every_minute topic (same as before)
Create normal subscription for topic cf_logs_file
Deploy 1st Cloud Function with runNewFiles entrypoint which publishes to cf_logs_files topic, triggered on-demand by GCS object finalize event
Deploy 2nd Cloud Function with runLoadJob entrypoint which is triggered by every_minute PubSub Topic (same as before) - but also now subscribes to and consumes cf_logs_files, getting a list of file names and subsequently mapping to Storage.File objects and doing batch loads via BigQuery.createLoadJob

Also:

Added additional schema file for Unix Nano Epoch timestamps which we use (INTEGER type)
Updated package dependencies to latest versions and removed now-unneeded dependencies
Removed writeLog function as it seemed unnecessarily redundant to normal Cloud Function stdout/stderr console logging

shagamemnon commented 3 years ago

Hey @gpacuilla thanks for the contribution! Can you please remove all of the semi-colons added by your linter and make a new commit?

shagamemnon commented 3 years ago

On my initial review, the PR you've provided seems too specific to your use case, and generates additional complexity for the majority of deployments.

That said: reading your PR comments, I do agree that detecting subdirectories is an unmet need in the current integration. However, I strongly disagree that using a deadline time is "janky", as it is a computationally-efficient solution that guarantees a) that the BigQuery quota will never be exceeded, and b) that all logs will make it into BigQuery. Of course, this solution relies on the assumption that this cloud function will respond to changes in a single bucket, for a single Cloudflare zone.

Why do you feel it is a better solution to buffer every filename into a PubSub queue? It seems to me that a better objective would be to find a way to query all relevant directories in the bucket

gabenp commented 3 years ago

Hey - thanks for taking a look!

Apologies if the language used caused offense, which I did not intend. We've used this project for some time with great success and we do really appreciate all the work done here on this.

At first I did try getting a list of the "prefixes" (sub-directories) from the apiResponse via the getFiles callback as seen here: https://googleapis.dev/nodejs/storage/latest/global.html#GetFilesCallback

Objects whose names, aside from the prefix, contain delimiter will have their name truncated after the delimiter, returned in apiResponse.prefixes.

That worked OK and could return an array of the subdirs in the bucket, so it does seem possible to then iterate on the subdirs with the same methodology as seen in master currently.

However, to clarify on what I meant by "janky" is that this code currently seems inflexible and makes some assumptions which I think could be wrong in certain situations.

Cloudflare Logpush is delayed, I know this has happened at least once in recent memory where new files were not being pushed for a period of time and then were backfilled. Given the current master methodology, we'd be liable to completely miss the backfilled data since it uses the PubSub every_minute context.timestamp trigger time (essentially "now")
If you change the schema on BigQuery side out of sync of the Cloud Function side, or vice-versa, and cause BigQuery load jobs to fail. At least when using the PubSub file name topic, it'd be trivial to just re-publish the missed file names to have it re-run the BigQuery load jobs. I was also thinking about better error checking on the bq.createLoadJob so that we can simply not-acknowledge (or re-publish) the file names in case of load errors, but that is not in place currently.
It's possible that Cloud Scheduler has issues and does not trigger the function for some time, also causing a range of files to be ignored and not loaded into BigQuery. (I know - what if this, what if that, but still this is a possible concern)

I do agree that this adds additional complexity compared to master - but re-introducing the previous method of triggering on storage.objects.finalize seems more consistent than the arbitrary timestamp logic. Not to mention much more immediate compared to the 15 minute look-behind currently done.

Granted all of this talk of backfilling data is not perfect especially when using partitioned-by-ingestion-time BigQuery table, but the EdgeStartTimestamp and EdgeEndTimestamp should still be accurate even on backfilled data, and we have queries which use those timestamp fields for range selection (as opposed to the _PARTITIONTIME)

I hope my reasoning here makes any sense to you and appreciate your further consideration on this.

kbroughton commented 3 years ago

"However, I strongly disagree that using a deadline time is "janky", as it is a computationally-efficient solution that guarantees a) that the BigQuery quota will never be exceeded, and b) that all logs will make it into BigQuery. Of course, this solution relies on the assumption that this cloud function will respond to changes in a single bucket, for a single Cloudflare zone."

My two cents is that since there was no documentation of why the deadline time was being used as it was, it did seem "janky" to me in that it would have taken a lot of digging into how all the google libs worked to reverse-understand why it was done that way. I built a bit of a subdirectory crawler myself before I saw this PR, but I'd certainly say that the additional architectural complexity would be justified for me because it's easier to follow than the deadline time logic without additional documentation.

shagamemnon commented 3 years ago

I acknowledge that the rationale was not documented. In addition, I agree there are some architectural advantages to using a queue instead of a scheduler. However this was the solution we settled on, and we don't have the resources available to consider rearchitecting this right now. You are welcome to continue working/using a fork, but I don't want to make any assurances that we will integrate this proposal

cloudflare / cloudflare-gcp

Support multiple zone directories in single bucket, additional PubSub mechanism to provide near-realtime file detection and pipelining #85