AtlasOfLivingAustralia / data-management

Data management issue tracking
7 stars 0 forks source link

Issue with data-encoding and reading csv's #1073

Open cha801p opened 2 weeks ago

cha801p commented 2 weeks ago

Description: The dr10574-ho-avh-weekly job has been consistently failing over the last few weeks. Upon further investigation, the failure was traced to an issue in the fetcher. The following error was noted: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb0 in position 1786: invalid start byte

Steps to Reproduce: Run the dr10574-ho-avh-weekly job. The job fails with the UnicodeDecodeError. Testing: Local Execution: The code was executed locally after establishing a connection with S3. It worked perfectly in the local environment and also ran without issues on Databox.

Data Examination: Upon thorough investigation of the data, an issue was observed with certain characters. The problematic data involves latitude and longitude values, specifically: Latitude and longitude automatically calculated from verbatim grid reference. Unadjusted latitude: 42�35'S. Unadjusted longitude: 147�52'E.,-42.583333,147.866119,EPSG:4202,50,0.000278,AMG 55 571100 5287900

Suggested Solution: A strategy is required to handle the specific encoded data in Preingestion. Potential solutions include:

  1. Identifying and handling non-UTF-8 encoded characters.
  2. Implementing error handling strategies such as errors='ignore' or errors='replace' in the file reading process.