cloudflare / cloudflare-gcp

Google Cloud Function to push json files from GC Storage to Big Query
Apache License 2.0
84 stars 49 forks source link

Support multiple zone directories in single bucket, additional PubSub mechanism to provide near-realtime file detection and pipelining #85

Open gabenp opened 3 years ago

gabenp commented 3 years ago

We had also recently run into the issue that was fixed in #70

However, the re-factoring of the code as part of that broke our particular setup which is:

Given the re-factoring in #70 we would have had to either, split the logs into different buckets, or deploy multiple functions per-subdirectory. However, doing either of those things would have re-introduced the BigQuery quota issue since we have enough zones that we would have exceeded the limit even on a per-minute schedule per-zone.

What I ended up doing here was re-introducing the "old" methodology of having a Cloud Function triggered by the GCS storage bucket finalize event, but we instead dump the detected filename into an additional PubSub topic so we can then separately pull all the newly detected files and load them into BigQuery immediately.

This also removes the (in my opinion - janky) deadlineDate/deadlineDt logic that was also introduced.

This code may not be perfect for all use cases but I wanted to open this PR as an example for others, and perhaps with some tweaks we could make this the standard setup and merge into master?

Overall, I think the "lookback" deadline time logic is not ideal, and having a more realtime "pipeline" using separate cloud functions and PubSub seems to be more logical IMO.

Basic Overview of differences:

Also:

shagamemnon commented 3 years ago

Hey @gpacuilla thanks for the contribution! Can you please remove all of the semi-colons added by your linter and make a new commit?

shagamemnon commented 3 years ago

On my initial review, the PR you've provided seems too specific to your use case, and generates additional complexity for the majority of deployments.

That said: reading your PR comments, I do agree that detecting subdirectories is an unmet need in the current integration. However, I strongly disagree that using a deadline time is "janky", as it is a computationally-efficient solution that guarantees a) that the BigQuery quota will never be exceeded, and b) that all logs will make it into BigQuery. Of course, this solution relies on the assumption that this cloud function will respond to changes in a single bucket, for a single Cloudflare zone.

Why do you feel it is a better solution to buffer every filename into a PubSub queue? It seems to me that a better objective would be to find a way to query all relevant directories in the bucket

gabenp commented 3 years ago

Hey - thanks for taking a look!

Apologies if the language used caused offense, which I did not intend. We've used this project for some time with great success and we do really appreciate all the work done here on this.

At first I did try getting a list of the "prefixes" (sub-directories) from the apiResponse via the getFiles callback as seen here: https://googleapis.dev/nodejs/storage/latest/global.html#GetFilesCallback

Objects whose names, aside from the prefix, contain delimiter will have their name truncated after the delimiter, returned in apiResponse.prefixes.

That worked OK and could return an array of the subdirs in the bucket, so it does seem possible to then iterate on the subdirs with the same methodology as seen in master currently.

However, to clarify on what I meant by "janky" is that this code currently seems inflexible and makes some assumptions which I think could be wrong in certain situations.

I do agree that this adds additional complexity compared to master - but re-introducing the previous method of triggering on storage.objects.finalize seems more consistent than the arbitrary timestamp logic. Not to mention much more immediate compared to the 15 minute look-behind currently done.

Granted all of this talk of backfilling data is not perfect especially when using partitioned-by-ingestion-time BigQuery table, but the EdgeStartTimestamp and EdgeEndTimestamp should still be accurate even on backfilled data, and we have queries which use those timestamp fields for range selection (as opposed to the _PARTITIONTIME)

I hope my reasoning here makes any sense to you and appreciate your further consideration on this.

kbroughton commented 3 years ago

"However, I strongly disagree that using a deadline time is "janky", as it is a computationally-efficient solution that guarantees a) that the BigQuery quota will never be exceeded, and b) that all logs will make it into BigQuery. Of course, this solution relies on the assumption that this cloud function will respond to changes in a single bucket, for a single Cloudflare zone."

My two cents is that since there was no documentation of why the deadline time was being used as it was, it did seem "janky" to me in that it would have taken a lot of digging into how all the google libs worked to reverse-understand why it was done that way. I built a bit of a subdirectory crawler myself before I saw this PR, but I'd certainly say that the additional architectural complexity would be justified for me because it's easier to follow than the deadline time logic without additional documentation.

shagamemnon commented 3 years ago

I acknowledge that the rationale was not documented. In addition, I agree there are some architectural advantages to using a queue instead of a scheduler. However this was the solution we settled on, and we don't have the resources available to consider rearchitecting this right now. You are welcome to continue working/using a fork, but I don't want to make any assurances that we will integrate this proposal