data-for-change / anyway

ANYWAY - Car accidents map
http://www.anyway.co.il
MIT License
75 stars 242 forks source link

Update cities table regularly from data.gov #2665

Open atalyaalon opened 1 month ago

atalyaalon commented 1 month ago

Today we're using cities from a static csv file path: static/cbs/cities.csv AND static/cbs/cities.xlsx We want to create a regular update of cities

The list in data.gov is here

  1. There are accidents with yishuv symbol but w/o yishuv_name, will this update solve this issue? (After re-loading of the data). See this query here and check the relevant yishuvs that are not in our system - do they appear in gov il data?
  2. Add it to our DB (I think it's exists in models.py with table_name cbs_cities.
  3. Update relevant code that use cities.csv file to use the new DB table (in main.py and localization.py) Make sure column names are matched accordingly if there is a mismatch between static cities.csv and data.gov table. Note that we should only use the relevant column of symbol number, yishuv name in hebrew, yishuv name in English, I think that other columns are not relevant to us (static/cbs/cities.csv probably contains many columns we're not using)
  4. Create a flow in Airflow that updates it once a month
  5. There is one line modification that you've created 2 years ago, I don't remember why and if we need to perform the same change on a new file, can you take a look? (see here)
ziv17 commented 3 weeks ago

Hi @atalyaalon , Regarding cities appearing in accidents without city name that appear in new cities table:

atalyaalon commented 1 week ago

Hi @atalyaalon , Regarding cities appearing in accidents without city name that appear in new cities table:

  • Cities appearing in accidents without city name that appear in new cities table:

    • 3400 חברון
    • 3815 תל ציון
    • 2826 שער שומרון These accidents will get the city names when reloaded with this fix completed.

Great!

  • The other entries without city names are of codes that does not appear in current CBS cities data. cbs_cities has more entries (1482 rows) then current CBS data (1273 cities, same number of cities in streets and cities tables).

OK, that's what we've got. CBS approved we can use that table so that's ok.

  • When reading accidents we search in static/data/cities.csv for the name according to the symbol. I will change it to search in the cities DB table instead.

OK sounds good

  • The data from CBS site will be populated to the current cbs_cities table, overriding the data from static/data/cities.csv file.

OK that's what we want

  • The cli command main.py create-tables update-cities-table appears to be unused. Should I remove it? We can remove if it's not used anywhere (unless you want to use this command and adjust it to your needs in Airflow)

In addition, I suggest addressing the the following issue: Background: AccidentMarker table ("markers") is used in ANYWAY map nowadays, and data of streets names and yishuves names is inserted using current CBS data (meaning if street name is changing we won't see it there) I think that's fine by us and we don't want to change the way data is inserted into this table nowadays.

However, we do want to update tables we use in anyway Media API, markers_hebrew involved_markers_hebrew in db_views.py file I suggest that tables markers_hebrew involved_markers_hebrew will consist of an additional join of street names fields (street1_hebrew, street2_hebrew) and instead of taking them from AccidentMarker table and having data that's not updated, I suggest joining them with streets table and cities table to get the most updated names. (It's SQL code, I can assist and review if needed, but from what I can see I assume these are 2 simple joins, just make sure the symbols of streets and cities are indexed in both tables)