datalab-dev / covid_worksite_exposure

Scraping and visualizing the UC Davis Potential Worksite Exposure Reporting (AB 685) data
MIT License
6 stars 4 forks source link

Large Files Break GitHub Action #37

Closed MicheleTobias closed 2 years ago

MicheleTobias commented 2 years ago

We've hit a milestone: the data files are now too large to store in GitHub.

The GitHub Action with the scraping code is failing with these errors (among others):

remote: error: File mapinput.txt is 100.99 MB; this exceeds GitHub's file size limit of 100.00 MB        
remote: error: File docs/exposure_data.js is 100.99 MB; this exceeds GitHub's file size limit of 100.00 MB        
remote: error: GH001: Large files detected. You may want to try Git Large File Storage - https://git-lfs.github.com./

We need to find a solution to continue to scrape and store the data. Possible solutions include:

Related issue: why is the map_input.txt file not in the data folder?

MicheleTobias commented 2 years ago

In talking with DataLab's Nick Ulle, GitHub LFS isn't a good solution because the data limits are shared across the entire account, so this project could use up all the storage for all the datalab-dev repositories.

In the short term, I think we may need to split the data file into two pieces and store the older data elsewhere. That would mean that the map would not have all the data available, at least for now, but perhaps that's not so bad, given the large number of days.

Nick suggested we look at Data Version Control to see if it would work for this project. We haven't tried it in DataLab yet, but it looks promising.

MicheleTobias commented 2 years ago

Another idea for a short-term solution... the exposures.csv file is the main dataset and is still pretty small. Maybe we update the code that makes the .geojson file (and mapinput.txt) to only use data from the last 6 months. Then we keep all the data but only build the map for recent history.

MicheleTobias commented 2 years ago

Other options (from DataLab standup):

  1. Compress geojson
  2. Run the daily chron job on the DataLab server and store the files there, and just keep the scripts and the .csv on GitHub
MicheleTobias commented 2 years ago

Writing the data to a topojson file might solve this because topojson will only store the geometries once. Our current geojson file is huge because every time a building has an exposure, we store a new building outline (geometry). Leaflet reportedly can read topojson.

MicheleTobias commented 2 years ago

As a temporary work-around, I just pushed code to the COVID exposures map that only makes the spatial data for the last 6 months. It keeps all the data in the .csv, but it only builds the geojson (txt) and js files for the recent dates.

MicheleTobias commented 2 years ago

flatgeobuff, a binary format, might be another solution. Also, minify could help reduce the file size of text-based formats by removing white space.

MicheleTobias commented 2 years ago

GDAL doesn't currently write topoJSON files, and the geojsonio package for R has disabled it for now because of a problem with the CRAN checks.

elistockwell commented 2 years ago

Was able to convert the mapinput.txt file (with an intermediate rewrite to mapinput.geojson) to topojson using npm: https://www.npmjs.com/package/topojson. It reduced the file size by a factor of 10! Now just need to figure out how to automate it in our code. This was all done in the command prompt so I'm not sure if it's possible to implement in R or javascript.

MicheleTobias commented 2 years ago

That sounds very promising! One clarification: In the code, we write a geojson file, then convert it to text, and now it's being converted back to geojson? If that's the case, we can simply a bit and remove the conversion to a text file.

elistockwell commented 2 years ago

Yes it looks like the text file is used to create the javascript variable, but we'll want to convert to topoJSON first and then create the text from that.

elistockwell commented 2 years ago

Would it be possible to have the github action run the command line stuff? This would mean that the large files would still be created but they would be deleted after the action runs.

elistockwell commented 2 years ago

Trying to use d3 to get the topojson to display in leaflet, following this tutorial: https://www.igismap.com/read-parse-render-topojson-file-leaflet-js/ As soon as I add the d3.json function, the map stops displaying. Unfortunately the tutorial doesn't provide the data (uk.json) they use, so I am unsure about the file structure, what uk.objects.places/subunits is referring to and how that translates to our topojson file.

erklopez commented 2 years ago

I found this github issue with the same problem, I saw that it was fixed with mapshaper through the command line, so I found a package named rmapshaper that implements the same simplification techniques. I then used the ms_simplify command on r for our file, it was successful and reduced the file size from about 180 MB to 29.1 MB, that seems like a great solution for us. I initially thought that it would reduce the quality of the map objects, but after further revision they look the same to me and leaflet can read it just fine. The documentation for the package can be found here.

MicheleTobias commented 2 years ago

That's great news! Do you feel comfortable trying to implement that? I think that might be just the thing.

MicheleTobias commented 2 years ago

I just saw that you made a PR already. Thank you!