Open gabenp opened 3 years ago
Hey @gpacuilla thanks for the contribution! Can you please remove all of the semi-colons added by your linter and make a new commit?
On my initial review, the PR you've provided seems too specific to your use case, and generates additional complexity for the majority of deployments.
That said: reading your PR comments, I do agree that detecting subdirectories is an unmet need in the current integration. However, I strongly disagree that using a deadline time is "janky", as it is a computationally-efficient solution that guarantees a) that the BigQuery quota will never be exceeded, and b) that all logs will make it into BigQuery. Of course, this solution relies on the assumption that this cloud function will respond to changes in a single bucket, for a single Cloudflare zone.
Why do you feel it is a better solution to buffer every filename into a PubSub queue? It seems to me that a better objective would be to find a way to query all relevant directories in the bucket
Hey - thanks for taking a look!
Apologies if the language used caused offense, which I did not intend. We've used this project for some time with great success and we do really appreciate all the work done here on this.
At first I did try getting a list of the "prefixes" (sub-directories) from the apiResponse
via the getFiles callback as seen here: https://googleapis.dev/nodejs/storage/latest/global.html#GetFilesCallback
Objects whose names, aside from the prefix, contain delimiter will have their name truncated after the delimiter, returned in apiResponse.prefixes.
That worked OK and could return an array of the subdirs in the bucket, so it does seem possible to then iterate on the subdirs with the same methodology as seen in master
currently.
However, to clarify on what I meant by "janky" is that this code currently seems inflexible and makes some assumptions which I think could be wrong in certain situations.
Cloudflare Logpush is delayed, I know this has happened at least once in recent memory where new files were not being pushed for a period of time and then were backfilled. Given the current master
methodology, we'd be liable to completely miss the backfilled data since it uses the PubSub every_minute context.timestamp
trigger time (essentially "now")
If you change the schema on BigQuery side out of sync of the Cloud Function side, or vice-versa, and cause BigQuery load jobs to fail. At least when using the PubSub file name topic, it'd be trivial to just re-publish the missed file names to have it re-run the BigQuery load jobs. I was also thinking about better error checking on the bq.createLoadJob so that we can simply not-acknowledge (or re-publish) the file names in case of load errors, but that is not in place currently.
It's possible that Cloud Scheduler has issues and does not trigger the function for some time, also causing a range of files to be ignored and not loaded into BigQuery. (I know - what if this, what if that, but still this is a possible concern)
I do agree that this adds additional complexity compared to master
- but re-introducing the previous method of triggering on storage.objects.finalize
seems more consistent than the arbitrary timestamp logic. Not to mention much more immediate compared to the 15 minute look-behind currently done.
Granted all of this talk of backfilling data is not perfect especially when using partitioned-by-ingestion-time BigQuery table, but the EdgeStartTimestamp
and EdgeEndTimestamp
should still be accurate even on backfilled data, and we have queries which use those timestamp fields for range selection (as opposed to the _PARTITIONTIME
)
I hope my reasoning here makes any sense to you and appreciate your further consideration on this.
"However, I strongly disagree that using a deadline time is "janky", as it is a computationally-efficient solution that guarantees a) that the BigQuery quota will never be exceeded, and b) that all logs will make it into BigQuery. Of course, this solution relies on the assumption that this cloud function will respond to changes in a single bucket, for a single Cloudflare zone."
My two cents is that since there was no documentation of why the deadline time was being used as it was, it did seem "janky" to me in that it would have taken a lot of digging into how all the google libs worked to reverse-understand why it was done that way. I built a bit of a subdirectory crawler myself before I saw this PR, but I'd certainly say that the additional architectural complexity would be justified for me because it's easier to follow than the deadline time logic without additional documentation.
I acknowledge that the rationale was not documented. In addition, I agree there are some architectural advantages to using a queue instead of a scheduler. However this was the solution we settled on, and we don't have the resources available to consider rearchitecting this right now. You are welcome to continue working/using a fork, but I don't want to make any assurances that we will integrate this proposal
We had also recently run into the issue that was fixed in #70
However, the re-factoring of the code as part of that broke our particular setup which is:
Given the re-factoring in #70 we would have had to either, split the logs into different buckets, or deploy multiple functions per-subdirectory. However, doing either of those things would have re-introduced the BigQuery quota issue since we have enough zones that we would have exceeded the limit even on a per-minute schedule per-zone.
What I ended up doing here was re-introducing the "old" methodology of having a Cloud Function triggered by the GCS storage bucket
finalize
event, but we instead dump the detected filename into an additional PubSub topic so we can then separately pull all the newly detected files and load them into BigQuery immediately.This also removes the (in my opinion - janky) deadlineDate/deadlineDt logic that was also introduced.
This code may not be perfect for all use cases but I wanted to open this PR as an example for others, and perhaps with some tweaks we could make this the standard setup and merge into master?
Overall, I think the "lookback" deadline time logic is not ideal, and having a more realtime "pipeline" using separate cloud functions and PubSub seems to be more logical IMO.
Basic Overview of differences:
cf_logs_files
andevery_minute
cf_logs_file
runNewFiles
entrypoint which publishes tocf_logs_files
topic, triggered on-demand by GCS object finalize eventrunLoadJob
entrypoint which is triggered byevery_minute
PubSub Topic (same as before) - but also now subscribes to and consumescf_logs_files
, getting a list of file names and subsequently mapping toStorage.File
objects and doing batch loads viaBigQuery.createLoadJob
Also:
writeLog
function as it seemed unnecessarily redundant to normal Cloud Function stdout/stderr console logging