Closed VladimirAlexiev closed 4 years ago
I added that when there is an encoding problem the RDFizer will encode the data to UTF-8. Please tell me if this solved the problem.
I'm in Bulgaria and cp1251 is Windows-Cyrillic, so it's possible Postgres decided to use that locale by default.
It's possible. The error suggests that there's a character in that triple that python's standard format doesn't recognize. Maybe it's a special character.
I only got accented chars. The forcing to UTF-8 worked.
In my experience, python doesn't like accented characters.
A secondary bug appeared:
Traceback (most recent call last):
File "SDM-RDFizer\rdfizer\rdfizer\semantify.py", line 3055, in semantify_postgres
output_file_descriptor.write(triple)
File "Python\Python37\lib\encodings\cp1251.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\xe4' in position 98: character maps to <undefined>
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "../SDM-RDFizer/rdfizer/run_rdfizer.py", line 3, in <module>
semantify(str(sys.argv[1]))
File "SDM-RDFizer\rdfizer\rdfizer\semantify.py", line 3402, in semantify
number_triple += executor.submit(semantify_postgres, row, row_headers, triples_map, triples_map_list, output_file_descriptor, wr, config[dataset_i]["name"],config[dataset_i]["user"], config[dataset_i]["password"], config[dataset_i]["db"], config[dataset_i]["host"]).result()
File "Python\Python37\lib\concurrent\futures\_base.py", line 432, in result
return self.__get_result()
File "Python\Python37\lib\concurrent\futures\_base.py", line 384, in __get_result
raise self._exception
File "Python\Python37\lib\concurrent\futures\thread.py", line 57, in run
result = self.fn(*self.args, **self.kwargs)
File "SDM-RDFizer\rdfizer\rdfizer\semantify.py", line 3057, in semantify_postgres
output_file_descriptor.write(triple.encode("utf-8"))
TypeError: write() argument must be str, not bytes
I'm sure python can handle unicode better (not relying on exceptions on top of exceptions), it's just a matter of figuring out what options to pass to the database.
I see this at the database level (as you can see, the encoding and collation don't quite match):
And here's part of the table definition:
CREATE TABLE _final.customer
(
customer_id character varying(255) COLLATE pg_catalog."default" NOT NULL,
first_name character varying(255) COLLATE pg_catalog."default",
...
When I query with pgAdmin, I see accented chars like Klötzlmüllerstr.
I found another possible solution. The exceptions are still present but I'll remove then if this solution works. Please tell me if it works.
I get the same error as the first time:
Traceback (most recent call last):
File "../SDM-RDFizer/rdfizer/run_rdfizer.py", line 3, in <module>
semantify(str(sys.argv[1]))
File "SDM-RDFizer\rdfizer\rdfizer\semantify.py", line 3402, in semantify
number_triple += executor.submit(semantify_postgres, row, row_headers, triples_map, triples_map_list, output_file_descriptor, wr, config[dataset_i]["name"],config[dataset_i]["user"], config[dataset_i]["password"], config[dataset_i]["db"], config[dataset_i]["host"]).result()
File "Python\Python37\lib\concurrent\futures\_base.py", line 432, in result
return self.__get_result()
File "Python\Python37\lib\concurrent\futures\_base.py", line 384, in __get_result
raise self._exception
File "Python\Python37\lib\concurrent\futures\thread.py", line 57, in run
result = self.fn(*self.args, **self.kwargs)
File "SDM-RDFizer\rdfizer\rdfizer\semantify.py", line 2644, in semantify_postgres
subject_value = string_substitution_array(triples_map.subject_map.value, "{(.+?)}", row, row_headers, "subject")
File "SDM-RDFizer\rdfizer\rdfizer\functions.py", line 289, in string_substitution_array
print(value)
File "Python\Python37\lib\encodings\cp1251.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\xe4' in position 28: character maps to <undefined>
Could you be so kind to run it again without the print?. The change I did was regarding the creation of the document, so there might be problems if you try to print the string directly. From what I determine the character in question is an "ä".
After producing 2.6M triples, the tool dies with:
Does it expect cp1251 or UTF-8? How do I control the encoding?