Closed sf-dcp closed 4 months ago
Assuming this can be closed?
Assuming this can be closed?
@fvankrieken, No, it hasn't been investigated.
for the latest FacDB build from nightly QA, I compared row counts between the FGDB, CSV, and Shapefile. all three have 35,252 rows
using dbeaver, I looked at this record in the source data and final table. the value in the FACNAME
column:
YOUNG MENS CHRISTIAN ASSOCIATION OF GREATER NEW YORK
YOUNG MENS CHRISTIAN ASSOCIATION OF GREATER NEW YORK
\u1a
is in between the N and SYOUNG MEN\x1aS CHRISTIAN ASSOCIATION OF GREATER NEW YORK
since I got the right row count when openning the FGDB via geopandas, it seems like ArcGIS has a problem with the character \u1a
which may become \x1a
during the build/export
This record comes from dohmh_daycare
source data, has uid = '769b3bb0a84eac14d5f758b43b98675a'
, and has the invalid character in that source data table.
found a sql query that seems to find records with u1a
aka x1a
, a unicode escape sequence
here's what those values look like when copied from dbeaver to a jupyter notebook cell (red box is the bad character)
the same use of WHERE column ~ '\x1a'
works on the source data in dohmh_daycare
. it returns 33 rows but they all have these five facname
values in the center_name
column.
(a revealing find) using that query on the column dohmh_daycare.legal_name
returns 53 rows. this is because the bad character becomes a regular apostrophe '
in the column we use which is center_name
:
I'm guessing this character at some point automatically replaced the non-standard apostrophe character ’
. I call it non-standard because it's different from the one on everyone's keyboard which is '
and I've seen this character wreak havoc before.
From the link you provided: "\x1A is the CTRL+Z control character. It is also EOF marker". <-- yikes
What:
When reviewing different versions of FacDB outputs (23v2 from Bytes, 24v1 draft), @jackrosacker noticed that the total count of records differed between csv and geodabase formats. Specifically, there were fewer records in geodabase. After further digging, Jack found that some records in csv output included a weird character which caused ArcGIS to omit all records after its first occurrence:
As seen in the last 2 screenshots, ArcGIS shows a total of 5,935 records, stopping at the first occurrence of the unknown character.
Next steps: