Trello
World Bank's files are UTF-16 encoded and previously we used trial and error to decode files downloaded from the Blob storage. Trying UTF-8, latin-1, UTF-16 etc. In the case of the World Bank files they would fail for the UTF-8 decoding then "successfully" get translated to text with latin-1 decoding but since they are actually UTF-16 this would corrupt the file which would then fail in the Validator API.
Now
Now we first try to decode with UTF-8, if that fails we use the chardet library to detect the encoding and translate the blob to text using that encoding.
Additions
I've also added a docker-compose.yml file that spins up the refresh and validate containers with a Docker based Postgres instances for easier local testing.
Issue
Trello World Bank's files are UTF-16 encoded and previously we used trial and error to decode files downloaded from the Blob storage. Trying UTF-8, latin-1, UTF-16 etc. In the case of the World Bank files they would fail for the UTF-8 decoding then "successfully" get translated to text with latin-1 decoding but since they are actually UTF-16 this would corrupt the file which would then fail in the Validator API.
Now
Now we first try to decode with UTF-8, if that fails we use the chardet library to detect the encoding and translate the blob to text using that encoding.
Additions
I've also added a
docker-compose.yml
file that spins up therefresh
andvalidate
containers with a Docker based Postgres instances for easier local testing.