gaza-reporters / gaza-reporters.github.io

0 stars 0 forks source link

Fix names imported from CPJ data #30

Open zstumgoren opened 9 months ago

zstumgoren commented 9 months ago

CPJ data contains a lot of non-ASCII characters (ie Arabic and other letters that fall outside of the American alphabet). The names appear to be mangled in Excel.

We should verify that we read the original JSON correctly and that these names are encoded likely as utf-8 or latin-1 (or something else, talk to Serdar).

irenecasado commented 9 months ago

Hey @zstumgoren, I have double checked and the names appear correctly when I work with the file with pandas/jupyter notebook. The names are encoded as 'utf-8' for reference.

zstumgoren commented 9 months ago

@irenecasado It's possible that Excel is mangling the names, but you should be certain by syncing up with @luyi-eve and together locate a few examples of non-unicode names. Then go back to Jupyter and locate those records to verify that they appear correctly in Jupyter. It's often the case that even though the data in Python is in utf-8, the source of the problem is the original "read" operation (e.g. using pandas.read_csv). If you don't specify the correct encoding on the incoming data source when you read, the data will seem to be in utf-8 in Python (which is the default encoding) but characters were lost during import.

zstumgoren commented 9 months ago

@irenecasado As part of this process, you should also check any mangled names found in Excel against the names as they appear on the CPJ website. If they appear correctly there but not in Jupyter or Excel, then it's likely you need to specify the correct encoding for the data when you read it into Python.