GreenInfo-Network / nyc-crash-mapper-etl-script

Extract, Transform, and Load script for fetching new data from the NYC Open Data Portal's vehicle collision data and loading into the NYC Crash Mapper table on CARTO.
3 stars 0 forks source link

Review data divergence in Crashmapper #37

Closed sync-by-unito[bot] closed 1 year ago

sync-by-unito[bot] commented 1 year ago

Tasks:

┆Issue is synchronized with this Asana task ┆Due Date: 2023-02-06

danrademacher commented 1 year ago

Start with backlog filling

Details below cover 753 crashes missing from carto.

2022

Note that this one is broken up into 2 runs because I was debugging the missing year for 2022 in the script

check_backlog.py 2022-12
Date range: >= 2022-12-01 AND < 2023-01-01
Getting SODA IDs for crashes already in CARTO
Got 8185 socrata_id entries for existing CARTO records
/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/urllib3/connectionpool.py:1013: InsecureRequestWarning: Unverified HTTPS request is being made to host 'data.cityofnewyork.us'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/1.26.x/advanced-usage.html#ssl-warnings
  warnings.warn(
Got 8187 crashes from SODA entries
Filtered to 2 crashes not yet present in CARTO
Formatting data...
Inserting chunk 1 of 1
{"rows":[],"time":2.542,"fields":{},"total_rows":0}
Done
(base) danrademacher@Dan-MBP-2 backlog % python3 check_backlog.py 2022-11
python3 check_backlog.py 2022-10
python3 check_backlog.py 2022-09
python3 check_backlog.py 2022-08
python3 check_backlog.py 2022-07
python3 check_backlog.py 2022-06
python3 check_backlog.py 2022-05
python3 check_backlog.py 2022-04
python3 check_backlog.py 2022-03
python3 check_backlog.py 2022-02
python3 check_backlog.py 2022-01

Date range: >= 2022-11-01 AND < 2022-12-01
Getting SODA IDs for crashes already in CARTO
Got 8105 socrata_id entries for existing CARTO records
/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/urllib3/connectionpool.py:1013: InsecureRequestWarning: Unverified HTTPS request is being made to host 'data.cityofnewyork.us'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/1.26.x/advanced-usage.html#ssl-warnings
  warnings.warn(
Got 8121 crashes from SODA entries
Filtered to 17 crashes not yet present in CARTO
Formatting data...
Inserting chunk 1 of 1
{"rows":[],"time":7.507,"fields":{},"total_rows":13}
Done
Date range: >= 2022-10-01 AND < 2022-11-01
Getting SODA IDs for crashes already in CARTO
Got 8884 socrata_id entries for existing CARTO records
/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/urllib3/connectionpool.py:1013: InsecureRequestWarning: Unverified HTTPS request is being made to host 'data.cityofnewyork.us'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/1.26.x/advanced-usage.html#ssl-warnings
  warnings.warn(
Got 8920 crashes from SODA entries
Filtered to 40 crashes not yet present in CARTO
Formatting data...
Inserting chunk 1 of 1
{"rows":[],"time":13.084,"fields":{},"total_rows":39}
Done
Date range: >= 2022-09-01 AND < 2022-10-01
Getting SODA IDs for crashes already in CARTO
Got 8792 socrata_id entries for existing CARTO records
/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/urllib3/connectionpool.py:1013: InsecureRequestWarning: Unverified HTTPS request is being made to host 'data.cityofnewyork.us'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/1.26.x/advanced-usage.html#ssl-warnings
  warnings.warn(
Got 8863 crashes from SODA entries
Filtered to 72 crashes not yet present in CARTO
Formatting data...
Inserting chunk 1 of 2
{"rows":[],"time":13.199,"fields":{},"total_rows":39}
Inserting chunk 2 of 2
{"rows":[],"time":11.424,"fields":{},"total_rows":32}
Done
Date range: >= 2022-08-01 AND < 2022-09-01
Getting SODA IDs for crashes already in CARTO
Got 8857 socrata_id entries for existing CARTO records
/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/urllib3/connectionpool.py:1013: InsecureRequestWarning: Unverified HTTPS request is being made to host 'data.cityofnewyork.us'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/1.26.x/advanced-usage.html#ssl-warnings
  warnings.warn(
Got 8917 crashes from SODA entries
Filtered to 62 crashes not yet present in CARTO
Formatting data...
Inserting chunk 1 of 2
{"rows":[],"time":13.445,"fields":{},"total_rows":40}
Inserting chunk 2 of 2
{"rows":[],"time":8.164,"fields":{},"total_rows":22}
Done
Date range: >= 2022-07-01 AND < 2022-08-01
Getting SODA IDs for crashes already in CARTO
Got 8819 socrata_id entries for existing CARTO records
/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/urllib3/connectionpool.py:1013: InsecureRequestWarning: Unverified HTTPS request is being made to host 'data.cityofnewyork.us'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/1.26.x/advanced-usage.html#ssl-warnings
  warnings.warn(
Got 8865 crashes from SODA entries
Filtered to 46 crashes not yet present in CARTO
Formatting data...
Inserting chunk 1 of 2
{"rows":[],"time":13.367,"fields":{},"total_rows":37}
Inserting chunk 2 of 2
{"rows":[],"time":3.108,"fields":{},"total_rows":5}
Done
Date range: >= 2022-06-01 AND < 2022-07-01
Getting SODA IDs for crashes already in CARTO
Got 9432 socrata_id entries for existing CARTO records
/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/urllib3/connectionpool.py:1013: InsecureRequestWarning: Unverified HTTPS request is being made to host 'data.cityofnewyork.us'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/1.26.x/advanced-usage.html#ssl-warnings
  warnings.warn(
Got 9467 crashes from SODA entries
Filtered to 38 crashes not yet present in CARTO
Formatting data...
Inserting chunk 1 of 1
{"rows":[],"time":12.823,"fields":{},"total_rows":34}
Done
Date range: >= 2022-05-01 AND < 2022-06-01
Getting SODA IDs for crashes already in CARTO
Got 9426 socrata_id entries for existing CARTO records
/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/urllib3/connectionpool.py:1013: InsecureRequestWarning: Unverified HTTPS request is being made to host 'data.cityofnewyork.us'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/1.26.x/advanced-usage.html#ssl-warnings
  warnings.warn(
Got 9458 crashes from SODA entries
Filtered to 36 crashes not yet present in CARTO
Formatting data...
Inserting chunk 1 of 1
{"rows":[],"time":12.761,"fields":{},"total_rows":34}
Done
Date range: >= 2022-04-01 AND < 2022-05-01
Getting SODA IDs for crashes already in CARTO
Got 8520 socrata_id entries for existing CARTO records
/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/urllib3/connectionpool.py:1013: InsecureRequestWarning: Unverified HTTPS request is being made to host 'data.cityofnewyork.us'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/1.26.x/advanced-usage.html#ssl-warnings
  warnings.warn(
Got 8542 crashes from SODA entries
Filtered to 24 crashes not yet present in CARTO
Formatting data...
Inserting chunk 1 of 1
{"rows":[],"time":8.951,"fields":{},"total_rows":23}
Done
Date range: >= 2022-03-01 AND < 2022-04-01
Getting SODA IDs for crashes already in CARTO
Got 8826 socrata_id entries for existing CARTO records
/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/urllib3/connectionpool.py:1013: InsecureRequestWarning: Unverified HTTPS request is being made to host 'data.cityofnewyork.us'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/1.26.x/advanced-usage.html#ssl-warnings
  warnings.warn(
Got 8846 crashes from SODA entries
Filtered to 21 crashes not yet present in CARTO
Formatting data...
Inserting chunk 1 of 1
{"rows":[],"time":7.908,"fields":{},"total_rows":18}
Done
Date range: >= 2022-02-01 AND < 2022-03-01
Getting SODA IDs for crashes already in CARTO
Got 7365 socrata_id entries for existing CARTO records
/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/urllib3/connectionpool.py:1013: InsecureRequestWarning: Unverified HTTPS request is being made to host 'data.cityofnewyork.us'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/1.26.x/advanced-usage.html#ssl-warnings
  warnings.warn(
Got 7379 crashes from SODA entries
Filtered to 17 crashes not yet present in CARTO
Formatting data...
Inserting chunk 1 of 1
{"rows":[],"time":6.403,"fields":{},"total_rows":16}
Done
Date range: >= 2022-01-01 AND < 2022-02-01
Getting SODA IDs for crashes already in CARTO
Got 7861 socrata_id entries for existing CARTO records
/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/urllib3/connectionpool.py:1013: InsecureRequestWarning: Unverified HTTPS request is being made to host 'data.cityofnewyork.us'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/1.26.x/advanced-usage.html#ssl-warnings
  warnings.warn(
Got 7903 crashes from SODA entries
Filtered to 44 crashes not yet present in CARTO
Formatting data...
Inserting chunk 1 of 2
{"rows":[],"time":13.872,"fields":{},"total_rows":39}
Inserting chunk 2 of 2
{"rows":[],"time":3.222,"fields":{},"total_rows":4}
Done

2021

python3 check_backlog.py 2021-11
python3 check_backlog.py 2021-10
python3 check_backlog.py 2021-09
python3 check_backlog.py 2021-08
python3 check_backlog.py 2021-07
python3 check_backlog.py 2021-06
python3 check_backlog.py 2021-05
python3 check_backlog.py 2021-04
python3 check_backlog.py 2021-03
python3 check_backlog.py 2021-02
python3 check_backlog.py 2021-01
Date range: >= 2021-11-01 AND < 2021-12-01
Getting SODA IDs for crashes already in CARTO
Got 9341 socrata_id entries for existing CARTO records
/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/urllib3/connectionpool.py:1013: InsecureRequestWarning: Unverified HTTPS request is being made to host 'data.cityofnewyork.us'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/1.26.x/advanced-usage.html#ssl-warnings
  warnings.warn(
Got 9375 crashes from SODA entries
Filtered to 34 crashes not yet present in CARTO
Formatting data...
Inserting chunk 1 of 1
{"rows":[],"time":11.451,"fields":{},"total_rows":27}
Done
Date range: >= 2021-10-01 AND < 2021-11-01
Getting SODA IDs for crashes already in CARTO
Got 10188 socrata_id entries for existing CARTO records
/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/urllib3/connectionpool.py:1013: InsecureRequestWarning: Unverified HTTPS request is being made to host 'data.cityofnewyork.us'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/1.26.x/advanced-usage.html#ssl-warnings
  warnings.warn(
Got 10204 crashes from SODA entries
Filtered to 21 crashes not yet present in CARTO
Formatting data...
Inserting chunk 1 of 1
{"rows":[],"time":7.521,"fields":{},"total_rows":20}
Done
Date range: >= 2021-09-01 AND < 2021-10-01
Getting SODA IDs for crashes already in CARTO
Got 9870 socrata_id entries for existing CARTO records
/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/urllib3/connectionpool.py:1013: InsecureRequestWarning: Unverified HTTPS request is being made to host 'data.cityofnewyork.us'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/1.26.x/advanced-usage.html#ssl-warnings
  warnings.warn(
Got 9895 crashes from SODA entries
Filtered to 27 crashes not yet present in CARTO
Formatting data...
Inserting chunk 1 of 1
{"rows":[],"time":10.009,"fields":{},"total_rows":26}
Done
Date range: >= 2021-08-01 AND < 2021-09-01
Getting SODA IDs for crashes already in CARTO
Got 9865 socrata_id entries for existing CARTO records
/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/urllib3/connectionpool.py:1013: InsecureRequestWarning: Unverified HTTPS request is being made to host 'data.cityofnewyork.us'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/1.26.x/advanced-usage.html#ssl-warnings
  warnings.warn(
Got 9880 crashes from SODA entries
Filtered to 16 crashes not yet present in CARTO
Formatting data...
Inserting chunk 1 of 1
{"rows":[],"time":6.552,"fields":{},"total_rows":16}
Done
Date range: >= 2021-07-01 AND < 2021-08-01
Getting SODA IDs for crashes already in CARTO
Got 9994 socrata_id entries for existing CARTO records
/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/urllib3/connectionpool.py:1013: InsecureRequestWarning: Unverified HTTPS request is being made to host 'data.cityofnewyork.us'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/1.26.x/advanced-usage.html#ssl-warnings
  warnings.warn(
Got 10002 crashes from SODA entries
Filtered to 8 crashes not yet present in CARTO
Formatting data...
Inserting chunk 1 of 1
{"rows":[],"time":3.672,"fields":{},"total_rows":4}
Done
Date range: >= 2021-06-01 AND < 2021-07-01
Getting SODA IDs for crashes already in CARTO
Got 10607 socrata_id entries for existing CARTO records
/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/urllib3/connectionpool.py:1013: InsecureRequestWarning: Unverified HTTPS request is being made to host 'data.cityofnewyork.us'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/1.26.x/advanced-usage.html#ssl-warnings
  warnings.warn(
Got 10608 crashes from SODA entries
Filtered to 6 crashes not yet present in CARTO
Formatting data...
Inserting chunk 1 of 1
{"rows":[],"time":3.041,"fields":{},"total_rows":4}
Done
Date range: >= 2021-05-01 AND < 2021-06-01
Getting SODA IDs for crashes already in CARTO
Got 10275 socrata_id entries for existing CARTO records
/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/urllib3/connectionpool.py:1013: InsecureRequestWarning: Unverified HTTPS request is being made to host 'data.cityofnewyork.us'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/1.26.x/advanced-usage.html#ssl-warnings
  warnings.warn(
Got 10289 crashes from SODA entries
Filtered to 16 crashes not yet present in CARTO
Formatting data...
Inserting chunk 1 of 1
{"rows":[],"time":5.937,"fields":{},"total_rows":10}
Done
Date range: >= 2021-04-01 AND < 2021-05-01
Getting SODA IDs for crashes already in CARTO
Got 8750 socrata_id entries for existing CARTO records
/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/urllib3/connectionpool.py:1013: InsecureRequestWarning: Unverified HTTPS request is being made to host 'data.cityofnewyork.us'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/1.26.x/advanced-usage.html#ssl-warnings
  warnings.warn(
Got 8752 crashes from SODA entries
Filtered to 6 crashes not yet present in CARTO
Formatting data...
Inserting chunk 1 of 1
{"rows":[],"time":3.425,"fields":{},"total_rows":3}
Done
Date range: >= 2021-03-01 AND < 2021-04-01
Getting SODA IDs for crashes already in CARTO
Got 8260 socrata_id entries for existing CARTO records
/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/urllib3/connectionpool.py:1013: InsecureRequestWarning: Unverified HTTPS request is being made to host 'data.cityofnewyork.us'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/1.26.x/advanced-usage.html#ssl-warnings
  warnings.warn(
Got 8262 crashes from SODA entries
Filtered to 6 crashes not yet present in CARTO
Formatting data...
Inserting chunk 1 of 1
{"rows":[],"time":2.896,"fields":{},"total_rows":4}
Done
Date range: >= 2021-02-01 AND < 2021-03-01
Getting SODA IDs for crashes already in CARTO
Got 6959 socrata_id entries for existing CARTO records
/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/urllib3/connectionpool.py:1013: InsecureRequestWarning: Unverified HTTPS request is being made to host 'data.cityofnewyork.us'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/1.26.x/advanced-usage.html#ssl-warnings
  warnings.warn(
Got 6975 crashes from SODA entries
Filtered to 18 crashes not yet present in CARTO
Formatting data...
Inserting chunk 1 of 1
{"rows":[],"time":6.708,"fields":{},"total_rows":13}
Done
Date range: >= 2021-01-01 AND < 2021-02-01
Getting SODA IDs for crashes already in CARTO
Got 7669 socrata_id entries for existing CARTO records
/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/urllib3/connectionpool.py:1013: InsecureRequestWarning: Unverified HTTPS request is being made to host 'data.cityofnewyork.us'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/1.26.x/advanced-usage.html#ssl-warnings
  warnings.warn(
Got 7718 crashes from SODA entries
Filtered to 53 crashes not yet present in CARTO
Formatting data...
Inserting chunk 1 of 2
{"rows":[],"time":13.421,"fields":{},"total_rows":39}
Inserting chunk 2 of 2
{"rows":[],"time":5.281,"fields":{},"total_rows":12}
Done

2020

python3 check_backlog.py 2020-12
python3 check_backlog.py 2020-11
python3 check_backlog.py 2020-10
python3 check_backlog.py 2020-09
python3 check_backlog.py 2020-08
python3 check_backlog.py 2020-07
python3 check_backlog.py 2020-06
python3 check_backlog.py 2020-05
python3 check_backlog.py 2020-04
python3 check_backlog.py 2020-03
python3 check_backlog.py 2020-02
python3 check_backlog.py 2020-01
Date range: >= 2020-12-01 AND < 2021-01-01
Getting SODA IDs for crashes already in CARTO
Got 8374 socrata_id entries for existing CARTO records
/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/urllib3/connectionpool.py:1013: InsecureRequestWarning: Unverified HTTPS request is being made to host 'data.cityofnewyork.us'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/1.26.x/advanced-usage.html#ssl-warnings
  warnings.warn(
Got 8379 crashes from SODA entries
Filtered to 6 crashes not yet present in CARTO
Formatting data...
Inserting chunk 1 of 1
{"rows":[],"time":2.889,"fields":{},"total_rows":3}
Done
Date range: >= 2020-11-01 AND < 2020-12-01
Getting SODA IDs for crashes already in CARTO
Got 9026 socrata_id entries for existing CARTO records
/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/urllib3/connectionpool.py:1013: InsecureRequestWarning: Unverified HTTPS request is being made to host 'data.cityofnewyork.us'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/1.26.x/advanced-usage.html#ssl-warnings
  warnings.warn(
Got 9029 crashes from SODA entries
Filtered to 6 crashes not yet present in CARTO
Formatting data...
Inserting chunk 1 of 1
{"rows":[],"time":2.875,"fields":{},"total_rows":2}
Done
Date range: >= 2020-10-01 AND < 2020-11-01
Getting SODA IDs for crashes already in CARTO
Got 9713 socrata_id entries for existing CARTO records
/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/urllib3/connectionpool.py:1013: InsecureRequestWarning: Unverified HTTPS request is being made to host 'data.cityofnewyork.us'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/1.26.x/advanced-usage.html#ssl-warnings
  warnings.warn(
Got 9710 crashes from SODA entries
Filtered to 1 crashes not yet present in CARTO
Formatting data...
Inserting chunk 1 of 1
{"rows":[],"time":1.403,"fields":{},"total_rows":0}
Done
Date range: >= 2020-09-01 AND < 2020-10-01
Getting SODA IDs for crashes already in CARTO
Got 9607 socrata_id entries for existing CARTO records
/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/urllib3/connectionpool.py:1013: InsecureRequestWarning: Unverified HTTPS request is being made to host 'data.cityofnewyork.us'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/1.26.x/advanced-usage.html#ssl-warnings
  warnings.warn(
Got 9610 crashes from SODA entries
Filtered to 4 crashes not yet present in CARTO
Formatting data...
Inserting chunk 1 of 1
{"rows":[],"time":2.209,"fields":{},"total_rows":1}
Done
Date range: >= 2020-08-01 AND < 2020-09-01
Getting SODA IDs for crashes already in CARTO
Got 9821 socrata_id entries for existing CARTO records
/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/urllib3/connectionpool.py:1013: InsecureRequestWarning: Unverified HTTPS request is being made to host 'data.cityofnewyork.us'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/1.26.x/advanced-usage.html#ssl-warnings
  warnings.warn(
Got 9823 crashes from SODA entries
Filtered to 5 crashes not yet present in CARTO
Formatting data...
Inserting chunk 1 of 1
{"rows":[],"time":2.564,"fields":{},"total_rows":3}
Done
Date range: >= 2020-07-01 AND < 2020-08-01
Getting SODA IDs for crashes already in CARTO
Got 9277 socrata_id entries for existing CARTO records
/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/urllib3/connectionpool.py:1013: InsecureRequestWarning: Unverified HTTPS request is being made to host 'data.cityofnewyork.us'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/1.26.x/advanced-usage.html#ssl-warnings
  warnings.warn(
Got 9277 crashes from SODA entries
Filtered to 2 crashes not yet present in CARTO
Formatting data...
Inserting chunk 1 of 1
{"rows":[],"time":1.495,"fields":{},"total_rows":0}
Done
Date range: >= 2020-06-01 AND < 2020-07-01
Getting SODA IDs for crashes already in CARTO
Got 7640 socrata_id entries for existing CARTO records
/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/urllib3/connectionpool.py:1013: InsecureRequestWarning: Unverified HTTPS request is being made to host 'data.cityofnewyork.us'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/1.26.x/advanced-usage.html#ssl-warnings
  warnings.warn(
Got 7646 crashes from SODA entries
Filtered to 8 crashes not yet present in CARTO
Formatting data...
Inserting chunk 1 of 1
{"rows":[],"time":3.615,"fields":{},"total_rows":5}
Done
Date range: >= 2020-05-01 AND < 2020-06-01
Getting SODA IDs for crashes already in CARTO
Got 6163 socrata_id entries for existing CARTO records
/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/urllib3/connectionpool.py:1013: InsecureRequestWarning: Unverified HTTPS request is being made to host 'data.cityofnewyork.us'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/1.26.x/advanced-usage.html#ssl-warnings
  warnings.warn(
Got 6164 crashes from SODA entries
Filtered to 3 crashes not yet present in CARTO
Formatting data...
Inserting chunk 1 of 1
{"rows":[],"time":2.003,"fields":{},"total_rows":1}
Done
Date range: >= 2020-04-01 AND < 2020-05-01
Getting SODA IDs for crashes already in CARTO
Got 4130 socrata_id entries for existing CARTO records
/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/urllib3/connectionpool.py:1013: InsecureRequestWarning: Unverified HTTPS request is being made to host 'data.cityofnewyork.us'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/1.26.x/advanced-usage.html#ssl-warnings
  warnings.warn(
Got 4130 crashes from SODA entries
Filtered to 3 crashes not yet present in CARTO
Formatting data...
Inserting chunk 1 of 1
{"rows":[],"time":2.168,"fields":{},"total_rows":3}
Done
Date range: >= 2020-03-01 AND < 2020-04-01
Getting SODA IDs for crashes already in CARTO
Got 11031 socrata_id entries for existing CARTO records
/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/urllib3/connectionpool.py:1013: InsecureRequestWarning: Unverified HTTPS request is being made to host 'data.cityofnewyork.us'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/1.26.x/advanced-usage.html#ssl-warnings
  warnings.warn(
Got 11077 crashes from SODA entries
Filtered to 62 crashes not yet present in CARTO
Formatting data...
Inserting chunk 1 of 2
{"rows":[],"time":10.967,"fields":{},"total_rows":3}
Inserting chunk 2 of 2
{"rows":[],"time":6.351,"fields":{},"total_rows":0}
Done
Date range: >= 2020-02-01 AND < 2020-03-01
Getting SODA IDs for crashes already in CARTO
Got 13745 socrata_id entries for existing CARTO records
/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/urllib3/connectionpool.py:1013: InsecureRequestWarning: Unverified HTTPS request is being made to host 'data.cityofnewyork.us'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/1.26.x/advanced-usage.html#ssl-warnings
  warnings.warn(
Got 13704 crashes from SODA entries
Filtered to 17 crashes not yet present in CARTO
Formatting data...
Inserting chunk 1 of 1
{"rows":[],"time":6.212,"fields":{},"total_rows":0}
Done
Date range: >= 2020-01-01 AND < 2020-02-01
Getting SODA IDs for crashes already in CARTO
Got 14365 socrata_id entries for existing CARTO records
/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/urllib3/connectionpool.py:1013: InsecureRequestWarning: Unverified HTTPS request is being made to host 'data.cityofnewyork.us'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/1.26.x/advanced-usage.html#ssl-warnings
  warnings.warn(
Got 14366 crashes from SODA entries
Filtered to 6 crashes not yet present in CARTO
Formatting data...
Inserting chunk 1 of 1
{"rows":[],"time":2.543,"fields":{},"total_rows":3}
Done
danrademacher commented 1 year ago

Fix Tallies

From https://chekpeds.carto.com/tables/crashes_all_prod

SELECT
    socrata_id,
    number_of_persons_injured,
    number_of_cyclist_injured,
    number_of_motorist_injured,
    number_of_pedestrian_injured,
    number_of_persons_killed,
    number_of_cyclist_killed,
    number_of_motorist_killed,
    number_of_pedestrian_killed
FROM crashes_all_prod
WHERE
    socrata_id IS NOT NULL
    AND date_val::date >= '2020-01-01' AND date_val::date < '2022-12-31'
    ORDER BY date_val

And then got CSV from Socrata: https://data.cityofnewyork.us/resource/qiz3-axqb.csv?$where=date%3E=%272020-01-01%27%20AND%20date%3C%272022-12-31%27&$order=date%20ASC&$limit=1000000

I had to make a couple of changes to the script to get it to run. See 5063d5f4fdf3d66d922b6f64de3be2afb61fd9e9 ad 2a2c99ef1d4e80df7291284d7009bfb74b296535

Then it ran for a while and I got:

{"error":["You are over platform's limits: SQL query timeout error. Refactor your query before running again or contact CARTO support for more details."],"context":"limit","detail":"datasource"}

My first round crash_diffs.csv looked like this

socrata_id,number_of_persons_injured,number_of_cyclist_injured,number_of_motorist_injured,number_of_pedestrians_injured,number_of_persons_killed,number_of_cyclist_killed,number_of_motorist_killed,number_of_pedestrians_killed
4272496,3,0,3,0,0,0,0,0
4310365,2,0,2,0,2,0,2,0
4315836,1,1,0,0,0,0,0,0
4317929,1,0,1,0,0,0,0,0
4318347,1,0,1,0,0,0,0,0
4333367,2,0,2,0,0,0,0,0
4330838,2,0,1,1,0,0,0,0
4340750,1,1,0,0,0,0,0,0
4345572,0,0,0,0,1,0,0,1
4343981,1,0,1,0,0,0,0,0
4345021,0,0,0,0,1,0,0,1
4345804,2,0,2,0,0,0,0,0
4349231,1,0,1,0,0,0,0,0
4350554,0,0,0,0,1,0,0,1
4349728,1,0,0,1,1,0,1,0
4349741,6,0,6,0,0,0,0,0
4350465,2,0,2,0,0,0,0,0
4350999,2,0,2,0,0,0,0,0
4351123,1,0,1,0,0,0,0,0
4351821,0,0,0,0,1,1,0,0
4351824,1,0,1,0,1,0,1,0
4351439,0,0,0,0,1,0,0,1
4351322,0,0,0,0,1,0,0,1
4353743,2,0,2,0,0,0,0,0
4351356,8,0,8,0,0,0,0,0
4351956,4,0,4,0,0,0,0,0
4352984,2,0,2,0,0,0,0,0
4353968,2,0,2,0,0,0,0,0
4353559,2,0,2,0,0,0,0,0
4354428,3,0,3,0,0,0,0,0
4354449,8,0,8,0,0,0,0,0
4356710,2,0,2,0,0,0,0,0
4354494,3,0,3,0,0,0,0,0
4356433,2,0,2,0,0,0,0,0
4354958,0,0,0,0,1,0,0,1
4355601,3,0,3,0,0,0,0,0
4356320,0,0,0,0,1,0,0,1
4356451,2,0,2,0,0,0,0,0
4356271,1,0,1,0,0,0,0,0
4356785,3,0,3,0,0,0,0,0
4356219,1,0,1,0,0,0,0,0
4356848,2,0,2,0,0,0,0,0
4356503,1,0,1,0,0,0,0,0
4357136,1,0,1,0,0,0,0,0
4358018,0,0,0,0,0,0,0,0
4358000,0,0,0,0,1,0,0,1
4358224,2,0,2,0,0,0,0,0
4358899,0,0,0,0,1,0,0,1
4358649,1,0,1,0,1,0,0,1
4358913,0,0,0,0,1,0,1,0
4359512,0,0,0,0,0,0,0,0
4361297,3,0,3,0,0,0,0,0
4359759,1,0,1,0,0,0,0,0
4359646,0,0,0,0,1,0,0,1
4360296,1,0,1,0,0,0,0,0
4359986,1,1,0,0,0,0,0,0
4360374,1,0,1,0,0,0,0,0
4360513,2,0,2,0,0,0,0,0
4361302,5,0,5,0,0,0,0,0
4361105,1,0,1,0,0,0,0,0
4361535,4,0,4,0,0,0,0,0
4362350,1,0,1,0,2,0,2,0
4362609,1,0,1,0,0,0,0,0
4362364,1,0,1,0,0,0,0,0
4362433,1,0,1,0,1,0,0,1
4363167,2,0,2,0,0,0,0,0
4363097,2,0,2,0,0,0,0,0
4364242,2,0,2,0,0,0,0,0
4366249,2,0,2,0,0,0,0,0
4365470,1,0,1,0,0,0,0,0
4365099,0,0,0,0,1,1,0,0
4365386,4,0,4,0,0,0,0,0
4366129,3,0,3,0,0,0,0,0
4366939,0,0,0,0,1,0,1,0
4370080,4,0,4,0,0,0,0,0
4367636,1,0,1,0,0,0,0,0
4367384,0,0,0,0,0,0,0,0
4368873,3,0,3,0,0,0,0,0
4370328,1,0,1,0,0,0,0,0
4369139,2,0,2,0,0,0,0,0
4369243,3,0,3,0,0,0,0,0
4369302,2,0,0,2,0,0,0,0
4369959,3,0,3,0,0,0,0,0
4370286,2,0,2,0,0,0,0,0
4376575,1,0,1,0,0,0,0,0
4370826,1,1,0,0,0,0,0,0
4374772,0,0,0,0,1,0,0,1
4372273,1,1,0,0,0,0,0,0
4370980,2,0,2,0,0,0,0,0
4372727,0,0,0,0,1,1,0,0
4380424,3,0,3,0,0,0,0,0
4372481,1,0,1,0,0,0,0,0
4372382,1,1,0,0,0,0,0,0
4372804,2,0,2,0,0,0,0,0
4372732,1,0,1,0,1,0,1,0
4372953,1,0,1,0,0,0,0,0
4373933,6,0,6,0,0,0,0,0
4374310,1,0,1,0,0,0,0,0
4374650,1,0,1,0,1,0,0,1
4380563,1,0,0,1,0,0,0,0
4375381,0,0,0,0,1,0,1,0
4380725,1,0,1,0,0,0,0,0
4380486,1,0,0,1,0,0,0,0
4376113,0,0,0,0,1,0,1,0
4376895,4,0,4,0,0,0,0,0
4377596,4,0,4,0,0,0,0,0
4380427,3,0,3,0,0,0,0,0
4377662,0,0,0,0,1,0,0,1
4377577,3,0,3,0,0,0,0,0
4377789,1,0,1,0,0,0,0,0
4378107,2,0,2,0,0,0,0,0
4377857,1,1,0,0,0,0,0,0
4378157,0,0,0,0,0,0,0,0
4380521,1,0,1,0,0,0,0,0
4379638,1,0,1,0,0,0,0,0
4380400,1,0,1,0,0,0,0,0
4380488,1,1,0,0,0,0,0,0
4380698,1,0,1,0,0,0,0,0
4380654,1,0,0,1,0,0,0,0
4379674,4,0,4,0,0,0,0,0
4379374,2,0,2,0,1,0,1,0
4379392,0,0,0,0,1,0,1,0
4380461,1,0,1,0,0,0,0,0
4380535,1,0,0,1,0,0,0,0
4380519,1,0,1,0,0,0,0,0
4380497,2,0,2,0,0,0,0,0
4380422,1,0,1,0,0,0,0,0
4380431,1,1,0,0,0,0,0,0
4380694,1,0,1,0,0,0,0,0
4379943,1,0,0,1,0,0,0,0
4380395,1,0,1,0,0,0,0,0
4380550,1,0,0,1,0,0,0,0
4380456,1,1,0,0,0,0,0,0
4380710,1,0,1,0,0,0,0,0
4380445,1,0,0,1,0,0,0,0
4380450,1,1,0,0,0,0,0,0
4380617,4,0,4,0,0,0,0,0
4380274,2,0,2,0,0,0,0,0
4381664,2,0,2,0,0,0,0,0
4381518,1,0,1,0,0,0,0,0
4381969,1,0,1,0,0,0,0,0
4382929,0,0,0,0,1,0,1,0
4383789,1,0,1,0,0,0,0,0
4383926,0,0,0,0,1,0,0,1
4384065,0,0,0,0,1,0,0,1
4384555,1,1,0,0,0,0,0,0
4385139,3,0,3,0,0,0,0,0
4385768,4,0,4,0,0,0,0,0
4386052,3,0,3,0,0,0,0,0
4387163,2,0,2,0,0,0,0,0
4387037,0,0,0,0,0,0,0,0
4387388,7,0,7,0,0,0,0,0
4387420,3,0,3,0,0,0,0,0
4389213,1,0,1,0,0,0,0,0
4387369,,1,0,0,0,0,0,0
4387714,0,0,0,0,1,0,0,1
4390062,2,0,2,0,0,0,0,0
4390781,1,0,1,0,0,0,0,0
4391304,2,0,2,0,0,0,0,0
4391785,3,0,3,0,0,0,0,0
4392149,0,0,0,0,1,0,0,1
4392412,0,0,0,0,2,0,2,0
4393276,2,0,2,0,0,0,0,0
4393789,0,0,0,0,1,0,0,1
4393349,0,0,0,0,1,0,0,1
4397878,2,0,2,0,0,0,0,0
4393657,1,0,1,0,0,0,0,0
4394172,0,0,0,0,1,0,0,1
4394349,0,0,0,0,1,0,0,1
4394685,2,0,2,0,0,0,0,0
4395193,0,0,0,0,1,0,0,1
4396621,0,0,0,0,1,0,1,0
4404421,0,0,0,0,1,0,0,1
4398425,1,1,0,0,0,0,0,0
4399201,0,0,0,0,1,0,0,1
4398443,1,1,0,0,0,0,0,0
4398561,0,0,0,0,0,0,0,0
4399921,1,0,0,1,0,0,0,0
4399035,0,0,0,0,1,0,0,1
4399313,0,0,0,0,0,0,0,0
4399215,1,0,1,0,0,0,0,0
4399490,2,0,0,2,0,0,0,0
4400037,4,0,4,0,0,0,0,0
4400108,2,0,2,0,0,0,0,0
4401009,0,0,0,0,1,0,0,1
4401549,1,0,1,0,1,0,1,0
4402540,5,0,5,0,0,0,0,0
4403773,1,0,1,0,0,0,0,0
4405244,7,0,7,0,0,0,0,0
4407069,0,0,0,0,1,0,0,1
4405914,1,0,1,0,0,0,0,0
4419047,1,0,1,0,0,0,0,0
4595838,2,0,2,0,0,0,0,0
4594504,0,0,0,0,1,0,0,1

So 195 differences.

Based on this part of the update script, https://github.com/GreenInfo-Network/nyc-crash-mapper-etl-script/blob/master/fixtallies/2-update_carto.py#L30-L53, it looks to me like we didn't have to batch process these in the past.

danrademacher commented 1 year ago

I thought maybe I could split this up manually, but it's not the number of rows that is the problem.

Even a Diffs file with a single row:

socrata_id,number_of_persons_injured,number_of_cyclist_injured,number_of_motorist_injured,number_of_pedestrians_injured,number_of_persons_killed,number_of_cyclist_killed,number_of_motorist_killed,number_of_pedestrians_killed
4272496,3,0,3,0,0,0,0,0

Fails with

{"error":["You are over platform's limits: SQL query timeout error. Refactor your query before running again or contact CARTO support for more details."],"context":"limit","detail":"datasource"}

It really doesn't amke sense, given that that update query is not very complex at all, matching on an integer ID field!

danrademacher commented 1 year ago

I see now that each one of these queries is a single update, so the total number should not matter. This works in the GUI:

UPDATE crashes_all_prod SET
    number_of_persons_injured=3,
    number_of_cyclist_injured=0,
    number_of_motorist_injured=3,
    number_of_pedestrian_injured=0,
    number_of_persons_killed=0,
    number_of_cyclist_killed=0,
    number_of_motorist_killed=0,
    number_of_pedestrian_killed=0
WHERE socrata_id=4272496

Increasing the Sleep timeout to 2 seems to have fixed this issue.

danrademacher commented 1 year ago

Also did these but 0 missing geoms found...

python3 fix_null_geom_in_carto.py 2020-01
python3 fix_null_geom_in_carto.py 2020-02
python3 fix_null_geom_in_carto.py 2020-03
python3 fix_null_geom_in_carto.py 2020-04
python3 fix_null_geom_in_carto.py 2020-05
python3 fix_null_geom_in_carto.py 2020-06
python3 fix_null_geom_in_carto.py 2020-07
python3 fix_null_geom_in_carto.py 2020-08
python3 fix_null_geom_in_carto.py 2020-09
python3 fix_null_geom_in_carto.py 2020-10
python3 fix_null_geom_in_carto.py 2020-11
python3 fix_null_geom_in_carto.py 2020-12

python3 fix_null_geom_in_carto.py 2021-01
python3 fix_null_geom_in_carto.py 2021-02
python3 fix_null_geom_in_carto.py 2021-03
python3 fix_null_geom_in_carto.py 2021-04
python3 fix_null_geom_in_carto.py 2021-05
python3 fix_null_geom_in_carto.py 2021-06
python3 fix_null_geom_in_carto.py 2021-07
python3 fix_null_geom_in_carto.py 2021-08
python3 fix_null_geom_in_carto.py 2021-09
python3 fix_null_geom_in_carto.py 2021-10
python3 fix_null_geom_in_carto.py 2021-11
python3 fix_null_geom_in_carto.py 2021-12

python3 fix_null_geom_in_carto.py 2022-01
python3 fix_null_geom_in_carto.py 2022-02
python3 fix_null_geom_in_carto.py 2022-03
python3 fix_null_geom_in_carto.py 2022-04
python3 fix_null_geom_in_carto.py 2022-05
python3 fix_null_geom_in_carto.py 2022-06
python3 fix_null_geom_in_carto.py 2022-07
python3 fix_null_geom_in_carto.py 2022-08
python3 fix_null_geom_in_carto.py 2022-09
python3 fix_null_geom_in_carto.py 2022-10
python3 fix_null_geom_in_carto.py 2022-11
python3 fix_null_geom_in_carto.py 2022-12
danrademacher commented 1 year ago

Based on this review: https://docs.google.com/spreadsheets/d/1YqUjGVQL5cdslemfg39GOs6tLlVOv3XyHepJ_7kyS_s/edit#gid=0

We're in much better shape. 1 missing cyclist fatality from 2021 to look for. Then also check totals for injuries.

sync-by-unito[bot] commented 1 year ago

➤ Dan Rademacher commented:

More data check work here: https://docs.google.com/spreadsheets/d/1YqUjGVQL5cdslemfg39GOs6tLlVOv3XyHepJ_7kyS_s/edit#gid=1764953521 ( https://docs.google.com/spreadsheets/d/1YqUjGVQL5cdslemfg39GOs6tLlVOv3XyHepJ_7kyS_s/edit#gid=1764953521 )

total difference on a monthly basis for 2020-2022 is only 11 missing deaths or injuries. BUT some months we have too many and others too few.

Of 37 months including Jan 23, 34 have some variance....

danrademacher commented 1 year ago

One thing I didn't notice the first time I ran the backlog filler is that a few found missing crashes but then seemed not to import them

These are the ones that found missing crashes but then imported zero rows:

Date range: >= 2022-12-01 AND < 2023-01-01
Got 8185 socrata_id entries for existing CARTO records
Got 8187 crashes from SODA entries
Filtered to 2 crashes not yet present in CARTO
Formatting data...
Inserting chunk 1 of 1
{"rows":[],"time":2.542,"fields":{},"total_rows":0

Date range: >= 2020-10-01 AND < 2020-11-01
Got 9713 socrata_id entries for existing CARTO records
Got 9710 crashes from SODA entries
Filtered to 1 crashes not yet present in CARTO
Formatting data...
Inserting chunk 1 of 1
{"rows":[],"time":1.403,"fields":{},"total_rows":0

Date range: >= 2020-07-01 AND < 2020-08-01
Got 9277 socrata_id entries for existing CARTO records
Got 9277 crashes from SODA entries
Filtered to 2 crashes not yet present in CARTO
Formatting data...
Inserting chunk 1 of 1
{"rows":[],"time":1.495,"fields":{},"total_rows":0

Date range: >= 2020-02-01 AND < 2020-03-01
Got 13745 socrata_id entries for existing CARTO records
Got 13704 crashes from SODA entries
Filtered to 17 crashes not yet present in CARTO
Formatting data...
Inserting chunk 1 of 1
{"rows":[],"time":6.212,"fields":{},"total_rows":0

Of those, only 2020-02 is particularly egregious in terms of data variance -- and the numbers are sowing we have 15 MORE crashes in CARTO than we have in Socrata.

danrademacher commented 1 year ago

@gregallensworth since things are opening up a bit for you: Christine wants us to take a little time to assess why I was seeing a lot of monthly variances in crash counts as shown here. That's a lot to sort through, so to narrow this down and make it simpler:

Here are 6 months to look into and see what you can find:

These are ones where we seem to be missing about 9K crashes: Year/Month Crashmapper missing #
2020/08 -2178
2020/07 -3113
2020/06 -3515
And three months when we have about 17K extra crashes: Year/Month Crashmapper surplus #
2020/05 4913
2020/04 9571
2020/03 3332

Note that my point of comparison was not the SODA API itself, but an explort of a rollup report from the API using the Socrata API. That's linked in the sheet. But I think mostly you can just look at these six months between SODA and Carto and see what you find.

gregallensworth commented 1 year ago

2020-08

Some Backstory

We have seen cases of plain ol' data entry error on Socrata's part. We have even seem crashes in the future due to an incorrect month number, or due to swapping month and day (01/06/2020 and 06/01/2020).

And I think that's what we're seeing here.

Discrepancy

Focus on 2020/08 first, where CARTO/CrashMapper is missing crashes in SODA.

From CARTO, I exported a CSV of this query: SELECT * FROM crashes_all_prod WHERE date_val::date >= '2020-08-01' AND date_val::date <= '2020-08-31'

From SODA I exported a CSV https://data.cityofnewyork.us/resource/qiz3-axqb.csv?$order=crash_date%20DESC&$limit=50000&$where=crash_date%20>=%20'2020-08-01'%20AND%20crash_date%20<%20'2020-09-01'

CARTO CSV rows = 9,824 SODA CSV rows = 9,823

A net discrepancy of 1 crash.

In your worksheet you show 9,713 crashes in CrashMapper for this period, and I'm not sure where that came from.

Diffs

I generated lists of collision_id and socrata_id and diff'd them, and came up with five specific discrepancies:

All five are in CARTO, and I did queries at CARTO and SODA to get the details.

SELECT socrata_id, date_val FROM crashes_all_prod WHERE socrata_id IN (4340166, 4343509, 4344290, 4345572, 4349232)
https://data.cityofnewyork.us/resource/qiz3-axqb.json?$where=collision_id%3D4340166
https://data.cityofnewyork.us/resource/qiz3-axqb.json?$where=collision_id%3D4343509
https://data.cityofnewyork.us/resource/qiz3-axqb.json?$where=collision_id%3D4344290
https://data.cityofnewyork.us/resource/qiz3-axqb.json?$where=collision_id%3D4349232
https://data.cityofnewyork.us/resource/qiz3-axqb.json?$where=collision_id%3D4345572

Results

For these five crashes, the dates differ between SODA and CARTO:

socrata_id carto_date socrata_date
4340166 2020-07-15 2020-08-15
4343509 2020-07-29 2020-08-29
4344290 2020-08-01 2020-09-01
4349232 2020-08-21 2020-09-18
4345572 2020-08-30 2020-09-06

So again with NYC retroactively changing the crash dates. Three cases seem like pretty clear data entry typographyical fixes (09/01 instead of 08/01) while two are less obvious (bad reporting? looked at the wrong ticket while typing?).

gregallensworth commented 1 year ago

2020-03

From CARTO, I exported a CSV of this query: SELECT * FROM crashes_all_prod WHERE date_val::date >= '2020-03-01' AND date_val::date <= '2020-03-31'

From SODA I exported a CSV https://data.cityofnewyork.us/resource/qiz3-axqb.csv?$order=crash_date%20DESC&$limit=50000&$where=crash_date%20>=%20'2020-03-01'%20AND%20crash_date%20<%20'2020-04-01'

CARTO CSV rows = 11,034 SODA CSV rows = 11,077 Diffs: 16 in CARTO not in SODA, 59 in SODA not in CARTO.

16 in CARTO and not in SODA:

socrata_id carto_date soda_date
4298656 2020-03-07 2020-02-25
4298680 2020-03-06 2020-02-27
4298838 2020-03-07 2020-02-27
4298876 2020-03-06 2020-02-29
4298911 2020-03-06 2020-02-25
4299043 2020-03-05 2020-02-25
4299164 2020-03-07 2020-02-18
4299222 2020-03-07 2020-02-16
4299281 2020-03-03 2020-02-20
4299323 2020-03-07 2020-02-26
4299377 2020-03-07 2020-02-29
4299537 2020-03-07 2020-02-23
4299548 2020-03-04 2020-02-12
4299605 2020-03-03 2020-02-06
4299755 2020-03-07 2020-02-20
4299911 2020-03-01 2020-02-27

59 in SODA not CARTO (only deeply looked into a few):

socrata_id carto_date soda_date
4297228 2020-02-25 2020-03-03
4298646 2020-02-29 2020-03-10
4298684 2020-02-19 2020-03-10
4298744 2020-02-29 2020-03-11
4298772 2020-02-27 2020-03-10
4298774 2020-02-29 2020-03-05
4298775 2020-02-25 2020-03-09
4298787 2020-02-28 2020-03-10
4298803 2020-02-28 2020-03-09
4298859 2020-02-29 2020-03-08
4297228
4298646
4298684
4298744
4298772
4298774
4298775
4298787
4298803
4298859
4298901
4298949
4299062
4299128
4299229
4299242
4299248
4299254
4299271
4299295
4299309
4299360
4299525
4299637
4299705
4299708
4299714
4299740
4299758
4299759
4299765
4299771
4299775
4299778
4299794
4299795
4299799
4299808
4299815
4299819
4299825
4299827
4299854
4299870
4299887
4299909
4299910
4299934
4299935
4299945
4299983
4300041
4300045
4300079
4300104
4300137
4300242
4300265
4300285

Again, of the 26 discrepancies investigated deeply, all of them were retroactive date changes at NYC.

danrademacher commented 1 year ago

Ok, I think this is close enough and not worth chasing down those data changes.