cagov / caldata-mdsa-caltrans-pems

CalData's MDSA project with Caltrans on Performance Measurement System (PeMS) data
https://cagov.github.io/caldata-mdsa-caltrans-pems/
MIT License
6 stars 0 forks source link

Do we need 14 lanes of data from the data lake? #252

Closed ian-r-rose closed 2 months ago

ian-r-rose commented 3 months ago

@pingpingxiu-DOT-ca-gov has been loading 14 lanes worth of data from db96, as he said that the maximum number of lanes across the districts in the database is 14. However, according to @kengodleskidot, the maximum number of lanes in the source tables is 10.

What is causing this discrepancy?

pingpingxiu-DOT-ca-gov commented 3 months ago

Without any formal specification as I aware of, if we set up a upper bound of lanes that is larger than any district's, then we do not loss data in data relay stage.

If there is a formal specification saying the 10 is the maximal lanes of all districts, then we can change to 10.

ian-r-rose commented 3 months ago

Where does the number 14 come from? I thought you had indicated that at least one of the source tables shows 14 lanes.

pingpingxiu-DOT-ca-gov commented 3 months ago

I remember at one time I saw a district has some columns name such as LOOP12, SPD12, etc. And, because I did not check all districts, I applied a margin of extra 2 lanes, hence 14.

I cannot easily refer to what I saw back then, because we do not have a Data Dictionary. @ZhenyuZhu-Caltrans I feels like the Data Dictionary we are going to build should include the feature of referring to certain table/columns by URI, so that we can use those in discussions.

@ian-r-rose

ian-r-rose commented 3 months ago

@kengodleskidot do you have an idea of where there might be a table with up to 12 lanes?

kengodleskidot commented 2 months ago

@ian-r-rose I am only aware of the VDS30SEC tables that have up to 10 lanes: Image and the STATION_X_SUMMARY tables that contain up to 8 lanes of data: Image I also reviewed the STATION_CONFIG_LOG table and the maximum number of lanes I came across was 8: Image

jkarpen commented 2 months ago

@kengodleskidot and @pingpingxiu-DOT-ca-gov will meet and come to a decision on how to proceed on this.

pingpingxiu-DOT-ca-gov commented 2 months ago

Hi @ian-r-rose , we had a discussion on this internally with @kengodleskidot @mmmiah .

By looking at the data in the relay server, we did not see more than 10 lanes. Seems 14 lanes would not be necessary. @kengodleskidot will have more examination on how the lane number is configured in the DBT model and come up a best practice strategy to align lane numbers across the board.

I originally suggested keeping as is because extra columns don't occupy storage. However, assuming we do decide to downsize, what number to downsize? This would be a question for @kengodleskidot .

In terms of actual changes, I can just go ahead to change on the Data Relay, which will not break the downstream so the downstream will not expect any coordination work regarding this.

ian-r-rose commented 2 months ago

In terms of actual changes, I can just go ahead to change on the Data Relay, which will not break the downstream so the downstream will not expect any coordination work regarding this.

What change specifically would you be making? Dropping the number of lanes in the parquet files would be a breaking change, since we try to ingest those files into a raw table with 14 lanes

pingpingxiu-DOT-ca-gov commented 2 months ago

@ian-r-rose If this is a breaking change, we would also face breaking changes if we downscale to a tight number (say, 8, or 10) now and later on increase lane number. Having the sufficient number of columns eliminate the concern of potentially expensive upgrades down the road. FYI @kengodleskidot

ian-r-rose commented 2 months ago

I am wondering what's the situation on the clearhouse feeds side because they are by design equivalent to the VDS30SEC table. Do they have fixed number of lanes?

They are have eight lanes. That said, one of the first things we do is pivot this data to long format (cf #121 ) so that there is instead a column for "lane". This means that the long-pivoted version of the tables are compatible regardless of whether they have eight or 10 or 14 lanes.

The most expensive potential part of having extra lanes is that it makes joins with the detector config table more expensive: we might have to do a lot of work to attempt to join on lanes that don't actually exist. It's also possible that the Snowflake optimizer is able to remove those unnecessary rows from the join (or we could do it as part of our transform).

So I wouldn't worry about it too much. If it's easier to stick with 14, I think that's fine, but I'd ask you to document this choice in the loader scripts and in the dbt models

pingpingxiu-DOT-ca-gov commented 2 months ago

THe concensus is, we will do documentation to make sure the differences on the lane configurations throughout the system are well understood and not causing confusions. @jkarpen @ian-r-rose @ZhenyuZhu-Caltrans