Open chrisbartley opened 2 years ago
This would be really great. We should consider sharding it into 10-25 chunks to enable parallel download and incremental build of the map.
Important to enable gzip compression to speed loading; json consisting of mostly base-10 numbers is incredibly compressible. protobuf might be better overall, but zlib+json is really easy and probably good enough. If server-side CPU is a limitation for serving gzip, we can cache already-gzipped versions like we do for plume binary serving in PlumePGH.
On Fri, Mar 25, 2022 at 12:45 PM Chris Bartley @.***> wrote:
Sites like environmentaldata.org suffer from painfully slow load times as they try to load all of ESDR's public feeds. It might be nice to have a JSON cache of all public feeds which gets updated regularly (every 1-5 minutes?) and contains some essential, but minimal info about each feed.
Perhaps the following:
- id
- name
- latitude
- longitude
- lastUpload / maxTimeSecs
- exposure?
A query like this is a decent start:
select productId, id, name, latitude, longitude, UNIX_TIMESTAMP(lastUpload) as lastUploadSecs, maxTimeSecs, deviceIdfrom Feedswhere isPublic = 1order by productId, deviceId, id desc;
Ideas to consider:
- Store the JSON under ESDR's public directory, maybe in some subdirectory denoting it as a cache.
- Multiple versions, with differing amounts of info.
- Abbreviated field names in the interest of file size OR using some more compact format such as an array of arrays.
- Group by product ID?
- Sort by productId asc, deviceId asc, feedId desc...so that the most recent feed for a device comes first?
- Also generate separate JSON files per product?
Possible JSON format:
{ "version" : 1, "fields" : ["id", "name", "latitude", "longitude", "lastUploadSecs", "maxTimeSecs", "deviceId"], "feedsByProductId" : { "1" : [ [26087, "West Mifflin ACHD", 40.363144, -79.864837, 1576762626, 1576686600, 26017], [59665, "Pittsburgh ACHD", 40.4656, -79.9611, 1648222891, 1648218600, 56652] ], "8" : [ [4268, "CREATE Lab Test", 40.44382296127876, -79.94647309184074, 1635191877, 1635189738.36, 4260], [4277, "Red Room", 40.34107763959465, -80.069620013237, 1484140498, 1484084287, 4265] ], "9" : [ [4245, "Wifi Speck 6", 40.443738, -79.946481, 1422565308, 1422565433, 4230], [4313, "Outdoors", 40.50156314996969, -80.06125688552856, 1432395167, 1431359910, 4291] ] } }
— Reply to this email directly, view it on GitHub https://github.com/CMU-CREATE-Lab/esdr/issues/67, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACPJ46O4MIHF7U3UY5AOVDVBXUTDANCNFSM5RUVIQJA . You are receiving this because you are subscribed to this thread.Message ID: @.***>
Important to enable gzip compression to speed loading;
Compression is on by default with all ESDR responses, but yes. I like the idea of pre-gzipping. We could also pre-gzip and configure things so that they get served by Apache without having to proxy to Node.
We should consider sharding it into 10-25 chunks to enable parallel download and incremental build of the map.
Suggestions for sharding strategies welcome and appreciated. My first thought was to shard by product ID (because some projects care about only one or a select few products), but it gets kinda dumb quickly...almost half of the products have fewer than 10 feeds, many with only 1. We currently have 113645 public feeds, but 90% of them are PurpleAir (but that will drop once the conversion to purpleair_v2 is done).
Maybe settle on a fixed number of shards--or a fixed max number of feeds per shard (5k? 10K?)--along with a "table of contents" JSON which tells you which shard(s) to fetch for which product ID. PurpleAir would be sharded into multiple, whereas some other single shard would contain everything other than the top three in the table below (a total of 4338 feeds).
+--------+---------+----------------------------------------+
|numFeeds|productId|name |
+--------+---------+----------------------------------------+
|77320 |69 |PurpleAir |
|25486 |101 |purpleair_v2 |
|6501 |11 |AirNow |
|1076 |98 |airnow_reporting_area |
|797 |9 |speck_v1 |
|659 |84 |airnow_aqi_highest_five_city |
|241 |88 |sharp1014 |
|241 |87 |muxed_bme280_and_sgp30 |
|231 |22 |retigo_filestore_test4 |
|205 |102 |TCEQ |
|192 |34 |retigo_filestore_test5 |
|110 |82 |voc_site |
|94 |78 |VOC |
|88 |68 |RAMP |
|65 |80 |AWBA_PurpleAir |
|59 |99 |airviz_pmvoc_a_v1 |
|44 |35 |BAAQMD |
|28 |58 |awair |
|27 |8 |cattfish_v1 |
|22 |1 |ACHD |
|17 |81 |AWBA_Valero |
|15 |94 |RISE_smoke_recognition_v1 |
|15 |97 |RISE_smoke_recognition_v3 |
|14 |93 |AWBA_Chevron |
|11 |86 |AWBA_FencelineMartinez |
|11 |100 |clarity |
|10 |75 |filtrete_smart_sensor |
|9 |36 |fenceline_org |
|8 |71 |Halmon |
|5 |66 |noaa_nws_forecast_office |
|5 |29 |cpb |
|5 |85 |airnow_aqi_highest_five_ranking |
|4 |44 |flamingo_v1 |
|3 |39 |AcuRite |
|3 |89 |smoke_recognition |
|2 |83 |AWBA_FencelineRodeo |
|2 |76 |Argos |
|1 |62 |AlphaSense_OPCN2 |
|1 |42 |shenango_channel_smoke_detection_product|
|1 |63 |plant_module |
|1 |14 |retigo_test_product |
|1 |79 |ArgosSpectrometer |
|1 |46 |EnviroDylosV1_1 |
|1 |64 |hound |
|1 |17 |retigo_test_product2 |
|1 |47 |DylosDC1700 |
|1 |95 |RISE_smoke_recognition_v2 |
|1 |26 |teom |
|1 |56 |dylos_dc1100_pro_pc |
|1 |70 |HAL |
|1 |55 |grimm_edm_180 |
|1 |59 |air_quality_egg_pm |
|1 |10 |maap_5012 |
|1 |54 |foobot |
|1 |38 |avalon_wind |
|1 |57 |blueair_aware |
|1 |77 |adafruit_bme680 |
+--------+---------+----------------------------------------+
Sites like environmentaldata.org suffer from painfully slow load times as they try to load all of ESDR's public feeds. It might be nice to have a JSON cache of all public feeds which gets updated regularly (every 1-5 minutes?) and contains some essential, but minimal info about each feed.
Perhaps the following:
A query like this is a decent start:
Ideas to consider:
Possible JSON format: