IATI / refresher

A Python application which has the responsibility of tracking IATI data from around the Web and refreshing the core IATI software's data stores
GNU Affero General Public License v3.0
2 stars 0 forks source link

Fix/character encoding issue #20

Closed nosvalds closed 3 years ago

nosvalds commented 3 years ago

Issue

Trello World Bank's files are UTF-16 encoded and previously we used trial and error to decode files downloaded from the Blob storage. Trying UTF-8, latin-1, UTF-16 etc. In the case of the World Bank files they would fail for the UTF-8 decoding then "successfully" get translated to text with latin-1 decoding but since they are actually UTF-16 this would corrupt the file which would then fail in the Validator API.

Now

Now we first try to decode with UTF-8, if that fails we use the chardet library to detect the encoding and translate the blob to text using that encoding.

Additions

I've also added a docker-compose.yml file that spins up the refresh and validate containers with a Docker based Postgres instances for easier local testing.