data-for-change / anyway

ANYWAY - Car accidents map
http://www.anyway.co.il
MIT License
77 stars 244 forks source link

Update cities table regularly from data.gov #2665

Closed atalyaalon closed 3 months ago

atalyaalon commented 5 months ago

Today we're using cities from a static csv file path: static/cbs/cities.csv AND static/cbs/cities.xlsx We want to create a regular update of cities

The list in data.gov is here

  1. There are accidents with yishuv symbol but w/o yishuv_name, will this update solve this issue? (After re-loading of the data). See this query here and check the relevant yishuvs that are not in our system - do they appear in gov il data?
  2. Add it to our DB (I think it's exists in models.py with table_name cbs_cities.
  3. Update relevant code that use cities.csv file to use the new DB table (in main.py and localization.py) Make sure column names are matched accordingly if there is a mismatch between static cities.csv and data.gov table. Note that we should only use the relevant column of symbol number, yishuv name in hebrew, yishuv name in English, I think that other columns are not relevant to us (static/cbs/cities.csv probably contains many columns we're not using)
  4. Create a flow in Airflow that updates it once a month
  5. There is one line modification that you've created 2 years ago, I don't remember why and if we need to perform the same change on a new file, can you take a look? (see here)
ziv17 commented 5 months ago

Hi @atalyaalon , Regarding cities appearing in accidents without city name that appear in new cities table:

atalyaalon commented 5 months ago

Hi @atalyaalon , Regarding cities appearing in accidents without city name that appear in new cities table:

  • Cities appearing in accidents without city name that appear in new cities table:

    • 3400 חברון
    • 3815 תל ציון
    • 2826 שער שומרון These accidents will get the city names when reloaded with this fix completed.

Great!

  • The other entries without city names are of codes that does not appear in current CBS cities data. cbs_cities has more entries (1482 rows) then current CBS data (1273 cities, same number of cities in streets and cities tables).

OK, that's what we've got. CBS approved we can use that table so that's ok.

  • When reading accidents we search in static/data/cities.csv for the name according to the symbol. I will change it to search in the cities DB table instead.

OK sounds good

  • The data from CBS site will be populated to the current cbs_cities table, overriding the data from static/data/cities.csv file.

OK that's what we want

  • The cli command main.py create-tables update-cities-table appears to be unused. Should I remove it? We can remove if it's not used anywhere (unless you want to use this command and adjust it to your needs in Airflow)

In addition, I suggest addressing the the following issue: Background: AccidentMarker table ("markers") is used in ANYWAY map nowadays, and data of streets names and yishuves names is inserted using current CBS data (meaning if street name is changing we won't see it there) I think that's fine by us and we don't want to change the way data is inserted into this table nowadays.

However, we do want to update tables we use in anyway Media API, markers_hebrew involved_markers_hebrew in db_views.py file I suggest that tables markers_hebrew involved_markers_hebrew will consist of an additional join of street names fields (street1_hebrew, street2_hebrew) and instead of taking them from AccidentMarker table and having data that's not updated, I suggest joining them with streets table and cities table to get the most updated names. (It's SQL code, I can assist and review if needed, but from what I can see I assume these are 2 simple joins, just make sure the symbols of streets and cities are indexed in both tables)

ziv17 commented 3 months ago

Hi @atalyaalon , It look like the tests DB is built from files in s3, and does not contain data that exists in the production DB. If correct, we need to decide what do we want to do with data from data.gov. What do you think?

ziv17 commented 3 months ago

Hi @atalyaalon , Not sure whether I already wrote to you about it. Our tests are based on a db that is loaded from files. If we remove the cities.csv, then we do not have cities in the test db (nor streets). Should we download cities and streets from data.gov during the test db preparation stage?

ziv17 commented 3 months ago

Hi @atalyaalon regarding item #5, תעשיון דלתון. Our file had an error, two items with same Hebrew name. That itemt does not exist in the current list in data.gov. I think we do not need to do anything about it.

ziv17 commented 3 months ago

Hi @atalyaalon , Thee are ~16000 accidents in 154 different cities (Redash query) that the city does not appear in the current data.gov cities table. I wonder whether there is a correlation table between a removed city and the current city.