cal-itp / data-infra

Cal-ITP data infrastructure
https://docs.calitp.org/data-infra
GNU Affero General Public License v3.0
48 stars 13 forks source link

Reorganize Amplitude data in GCS #1240

Closed thekaveman closed 2 years ago

thekaveman commented 2 years ago

Part of the work for #960.

See this comment from @atvaccaro for details

Tasks

thekaveman commented 2 years ago

@atvaccaro notes:

I would consider making ingest_amplitude_raw_dev and ingest_amplitude_raw_dev_prod potentially.

Oh and to follow-up on this, I would use is_development() in calitp-py to pick between the two.

lauriemerrell commented 2 years ago

A bit more context here-- the AirtableToWarehouse operator calls calitp.save_to_gcs, which does still call get_bucket under the hood.

So, @atvaccaro and I do think it would be great if you all write to a brand new bucket -- that is very much aligned with the future direction of the bucket structure and will save a migration later. However as Andrew mentioned you would probably need to look to calitp.is_development directly in a new pattern and not be able to rely on the existing save_to_gcs or get_bucket since those are hard-coded to gtfs-data and gtfs-data-test.

(TLDR: We appreciate the openness to moving to the new paradigm but it does come with some extra work since we haven't built out as much support for that direction yet; if there's anything we can do to help make that clearer or easier please let us know.)

thekaveman commented 2 years ago

@lauriemerrell we're more than happy to help explore the new paradigm! Please let us know if we can be doing anything (docs?) to help make this easier next go-around.

It does appear that save_to_gcs() allows for a bucket param to override using get_bucket(); so we should be good with minimal changes, just using is_development() as mentioned.

lauriemerrell commented 2 years ago

Awesome, thank you! That's a great question re: docs... I actually think the main thing (not actually directly to related to this question about buckets) might just be adding a note in the datasets and tables section of the docs: https://docs.calitp.org/data-infra/datasets_and_tables/overview.html about what the Amplitude data is.