GreenInfo-Network / seattle-building-dashboard

Energy benchmarking for Seattle
https://greeninfo-network.github.io/seattle-building-dashboard/
ISC License
1 stars 0 forks source link

Test data update and documentation #58

Closed danrademacher closed 8 months ago

danrademacher commented 10 months ago

Let's see how well this new data from Mike matches what is needed: https://drive.google.com/file/d/1H3Mv4jxKS4iVhJo0EvXNXTIqdDbuPzbV/view?usp=drive_link

And along the way, we need to update this documentation https://github.com/GreenInfo-Network/seattle-building-dashboard/blob/master/docs/Update_data.md

Huh, that is old! Did we do this process or maybe Mike's predecessor did it. I have no recollection of following these old docs. I wonder if the new table count limits in CARTO will bite us here...

tomay commented 10 months ago

Unfortunately, yes, we have to delete 9 of 18 datasets on CARTO, in order to be able to upload this one new one (10 is the new limit).

Maybe we take this opportunity to move some of the data-only tables out of CARTO completely? These can be CSVs in the repo that we load in the usual way.

I've opened a separate ticket for those content tables: #59

And a new ticket for backing up and deleting old building data and footprint data: #60

tomay commented 10 months ago

There is not much overlap between the fields in the provided data and the fields required by the application. The casing of the fields, where they clearly match, is almost always different, and there are many omissions and commissions between the two datasets (existing and update).

I've gone through both sets of fields and tried to identify matches and/or add comments and questions here: https://docs.google.com/spreadsheets/d/1TJqAvxrBd4BrO8UWWRTg3edmRqaoVX6FyZq0ItGzjeA/edit#gid=1827985828

In addition, we need to

tomay commented 10 months ago

I've gone over the documentation for data updates, and it is overall okay. I'm not sure about step 6, but that's already flagged as "Optional" so it's okay to leave that as is.

The bigger issue is that the data update process has been slightly different every year. For example, 2020 and 2021 data were provided to us wholesale, with complete data for all years, whereas this documentation assumes that we are adding new annual data (for one year only) to an existing older dataset. The current data update is also different from any of the previous updates, in that it is intended to replace an existing year (2021) that is already in production.

Also, the documentation is obviously very CARTO centric. Since we are (likely) moving away from CARTO in the near future, I don't think we should put a lot of effort into cleaning this up (especially if GreenInfo is doing the update). It's good enough for the existing CARTO based data management, but will be completely obsolete once/if we move to some other system.

danrademacher commented 10 months ago

mike sent another new data table, which I saved here, https://docs.google.com/spreadsheets/d/1yID5uP4UssrLbYSIjLv7DdfznmezUrr9/edit#gid=814713088. This has the same schema as the file he provided earlier this week, which remains VERY different from the data that runs the application

I am going to email him with a note and ask again about the role of SBW here. We didn't scope any effort to create data in the correct schema, and there are parts of that work we've never done or even seen documented.

danrademacher commented 10 months ago

I have uploaded all of the files that currently list Carto in the name into the Sharepoint folder, as well as the R Script for reference.. I’ve skimmed the R Script and I’m still not totally sure which of these files are critical for GreenInfo’s update process. I can see that the following files were updated when I last ran the R Script on 6/14, so in theory these were the ones that were important enough to be integrated into the update process. Can you please review the files in the Sharepoint folder and confirm which of these you actually need updated with the latest data for your processes? It appears the same live-table problem affects these outputs, so I will need to manually update via Excel.

image

files: https://seattlegov.sharepoint.com/sites/OSE-External/Shared%20Documents/Forms/AllItems.aspx?ga=1&id=%2Fsites%2FOSE%2DExternal%2FShared%20Documents%2FBenchmarking%2FGreenInfo%20OSE%202023%2FCARTO%20files&viewid=35dba9c9%2Dc2c7%2D47c7%2Db35d%2D310aed53921c

tomay commented 10 months ago

Added some more comparisons to this doc: https://docs.google.com/spreadsheets/d/1TJqAvxrBd4BrO8UWWRTg3edmRqaoVX6FyZq0ItGzjeA/edit#gid=1827985828

Reply to Mike:

By looking at the fields in the three files you flagged, here's what I can see.

I'm not sure if that helps - it's really not clear what's different between Carto_2021_Data.csv and Carto_2021_Data_extra.csv, apart from the presence of that one field "uniqueid" in "Data_extra". I did not look closely for any differences in data values between those two files, however. Is there anything in the R scripts that indicates why it's outputting these two files?

tomay commented 10 months ago

Looking over the R script, the very end of the file has this

#New changes based on fields provided by Tom Allnutt
Carto_Data_extra <- Carto_Data %>%
  rename(parentid = ParentBuildingId) %>%
  mutate(uniqueid = paste(as.character(reportingyear), id, sep = '-'))

fname = paste('Data/Carto_', as.character(reportingyear),'_Data_extra', datamode, sep = '')
fullfilepath = paste(filefolderroot,fname, '.csv', sep = '') write.csv(Carto_Data_extra, file = fullfilepath, row.names = FALSE)

That’s the file named “extra”, but it’s still not entirely clear why they were writing two nearly (but not quite) identical data files

The "extra" file is written out at line 541. The other candidate file is written out at line 510, but the only change to the underlying Carto_Data object seems to be those two changes to parentid and uniqueid

tomay commented 10 months ago

seattle_data_columns_2020(1).txt

Mike has been in touch with Gina at SBW, and resent this list of fields that I had sent during the previous update cycle, which they used to format the 2021 data.

Oddly that list has parentid in lower case, so not sure why the R script is renaming that, it's not necessary.

Also, EPAPropertyType is missing from that list, which is strange, as it is definitely included in the 2021 data, albeit lower case.

Conclusion at this point is that Carto_2021_Data_extra.csv is the file to use going forward, though it seems like it will still need some minor cleanup prior to dumping in Carto:

danrademacher commented 10 months ago

I just uploaded a version of the data he sent: https://drive.google.com/open?id=1-KeFy9AJAz2YuPYhAbQdagQw_8xkiSCX&usp=drive_fs to https://cityenergy-seattle.carto.com/dataset/seattle_buildings_2021_update_202308 then this commit https://github.com/GreenInfo-Network/seattle-building-dashboard/commit/13460e50e6171752bc10f47ac373613411cd2145

And it seemed to "just work"

tomay commented 10 months ago

Nice. Did you have to remove "NA", or do any of the other renaming specified above?

Looks like this is only 2021 data, which I assume is going to completely replace existing 2021 production data (as opposed to being appended onto the existing table and existing 2021 data). We can do that with the query that's documented in Update_data.md, or manually join the tables in Drive and reupload, etc.

~The only other potential issue I see is that cbps_flag is TRUE and FALSE but we test with a simple JavaScript if, so this will always evaluate to true~. Actually, scratch that, somehow in the upload process this was automagically evaluated to JS true/false

tomay commented 10 months ago

One question Rebecca had was – are you planning on pushing updates to the live Data viz incrementally (i.e. as they’re ready) or are you planning on pushing to production once towards the end of the contract?

I'd very much prefer to update and evaluate everything one time (or as few times as possible). This will also help with the building outline update, which we'll have to check against new data each time.

tomay commented 10 months ago

One other note - we should probably delete the geometry field prior to upload, and let CARTO work that out from the lat/lng fields

danrademacher commented 10 months ago

For the record, I didn't do any renaming or removal of NAs. I just uploaded it and assumed I would get errors, but then I didn't. I confirmed via the network control panel that the app was indeed pulling from the new table I uploaded.

Field name capitalization might be taken care up on upload -- CARTO generally lowercases everything. I am less sure about NAs -- they must have removed them, Usually if those are in the data, CARTO will turn any such column into a string field and we would expect many filters to stop working as a result.

I wondered about the GEOM field. Agreed probably best to delete it in case any values got accidentally edited.