HTTPArchive / tech-report-apis

APIs for the HTTP Archive Technology Report
Apache License 2.0
2 stars 0 forks source link

Investigate Firestore ingestion optimizations #11

Open rviscomi opened 11 months ago

rviscomi commented 11 months ago

We're using Firestore as the intermediary storage layer for the API. The problem is that we're only able to import 500 rows of data at a time, so it's taking a very long time and creating issues with the initial backfill.

Investigate whether it's possible to import the entire table in one go, or at least in larger batches. This will speed up the backfill and monthly import jobs and also simplify the pipeline.

tunetheweb commented 11 months ago

Hmmm from a quick Google it does look like it's limited to 500 "operations":

maceto commented 10 months ago

Hi @rviscomi @tunetheweb,

I think we can close this issue, with Giancarlo help we were able to incorporate the process into DataFlow pipeline. For the full historical process takes severals hours and for last month updates takes under 25 mins.

Dataflow does a great job processing in parallel and inserting into Firestore.

rviscomi commented 10 months ago

Awesome! Were does that Dataflow code live, and are there any remaining documentation tasks worth tracking?