NYCPlanning / data-engineering

Primary repository for NYC DCP's Data Engineering team
22 stars 0 forks source link

Unknown Character in FacDB output #500

Closed sf-dcp closed 4 months ago

sf-dcp commented 10 months ago

What:

When reviewing different versions of FacDB outputs (23v2 from Bytes, 24v1 draft), @jackrosacker noticed that the total count of records differed between csv and geodabase formats. Specifically, there were fewer records in geodabase. After further digging, Jack found that some records in csv output included a weird character which caused ArcGIS to omit all records after its first occurrence:

As seen in the last 2 screenshots, ArcGIS shows a total of 5,935 records, stopping at the first occurrence of the unknown character.

Next steps:

fvankrieken commented 8 months ago

Assuming this can be closed?

sf-dcp commented 8 months ago

Assuming this can be closed?

@fvankrieken, No, it hasn't been investigated.

damonmcc commented 4 months ago

for the latest FacDB build from nightly QA, I compared row counts between the FGDB, CSV, and Shapefile. all three have 35,252 rows

using dbeaver, I looked at this record in the source data and final table. the value in the FACNAME column:

since I got the right row count when openning the FGDB via geopandas, it seems like ArcGIS has a problem with the character \u1a which may become \x1a during the build/export

This record comes from dohmh_daycare source data, has uid = '769b3bb0a84eac14d5f758b43b98675a', and has the invalid character in that source data table.

damonmcc commented 4 months ago

found a sql query that seems to find records with u1a aka x1a, a unicode escape sequence

here's what those values look like when copied from dbeaver to a jupyter notebook cell (red box is the bad character)

the same use of WHERE column ~ '\x1a' works on the source data in dohmh_daycare. it returns 33 rows but they all have these five facname values in the center_name column.

(a revealing find) using that query on the column dohmh_daycare.legal_name returns 53 rows. this is because the bad character becomes a regular apostrophe ' in the column we use which is center_name:

I'm guessing this character at some point automatically replaced the non-standard apostrophe character . I call it non-standard because it's different from the one on everyone's keyboard which is ' and I've seen this character wreak havoc before.

sf-dcp commented 4 months ago

From the link you provided: "\x1A is the CTRL+Z control character. It is also EOF marker". <-- yikes