SDM-TIB / SDM-RDFizer

An Efficient RML-Compliant Engine for Knowledge Graph Construction
https://doi.org/10.5281/zenodo.3872103
Apache License 2.0
108 stars 25 forks source link

UnicodeEncodeError: 'charmap' codec can't encode character '\xe4' in position 109: character maps to <undefined> #25

Closed VladimirAlexiev closed 4 years ago

VladimirAlexiev commented 4 years ago

After producing 2.6M triples, the tool dies with:

  File "../SDM-RDFizer/rdfizer/run_rdfizer.py", line 3, in <module>
    semantify(str(sys.argv[1]))
  File "SDM-RDFizer\rdfizer\rdfizer\semantify.py", line 3342, in semantify
    number_triple += executor.submit(semantify_postgres, row, row_headers, triples_map, triples_map_list, output_file_descriptor, wr, config[dataset_i]["name"],config[dataset_i]["user"], config[dataset_i]["password"], config[dataset_i]["db"], config[dataset_i]["host"]).result()
  File "Python\Python37\lib\concurrent\futures\_base.py", line 432, in result
    return self.__get_result()
  File "Python\Python37\lib\concurrent\futures\_base.py", line 384, in __get_result
    raise self._exception
  File "Python\Python37\lib\concurrent\futures\thread.py", line 57, in run
    result = self.fn(*self.args, **self.kwargs)
  File "SDM-RDFizer\rdfizer\rdfizer\semantify.py", line 3006, in semantify_postgres
    output_file_descriptor.write(triple)
  File "Python\Python37\lib\encodings\cp1251.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\xe4' in position 109: character maps to <undefined>

Does it expect cp1251 or UTF-8? How do I control the encoding?

eiglesias34 commented 4 years ago

I added that when there is an encoding problem the RDFizer will encode the data to UTF-8. Please tell me if this solved the problem.

VladimirAlexiev commented 4 years ago

I'm in Bulgaria and cp1251 is Windows-Cyrillic, so it's possible Postgres decided to use that locale by default.

eiglesias34 commented 4 years ago

It's possible. The error suggests that there's a character in that triple that python's standard format doesn't recognize. Maybe it's a special character.

VladimirAlexiev commented 4 years ago

I only got accented chars. The forcing to UTF-8 worked.

eiglesias34 commented 4 years ago

In my experience, python doesn't like accented characters.

VladimirAlexiev commented 4 years ago

A secondary bug appeared:

Traceback (most recent call last):
  File "SDM-RDFizer\rdfizer\rdfizer\semantify.py", line 3055, in semantify_postgres
    output_file_descriptor.write(triple)
  File "Python\Python37\lib\encodings\cp1251.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\xe4' in position 98: character maps to <undefined>

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "../SDM-RDFizer/rdfizer/run_rdfizer.py", line 3, in <module>
    semantify(str(sys.argv[1]))
  File "SDM-RDFizer\rdfizer\rdfizer\semantify.py", line 3402, in semantify
    number_triple += executor.submit(semantify_postgres, row, row_headers, triples_map, triples_map_list, output_file_descriptor, wr, config[dataset_i]["name"],config[dataset_i]["user"], config[dataset_i]["password"], config[dataset_i]["db"], config[dataset_i]["host"]).result()
  File "Python\Python37\lib\concurrent\futures\_base.py", line 432, in result
    return self.__get_result()
  File "Python\Python37\lib\concurrent\futures\_base.py", line 384, in __get_result
    raise self._exception
  File "Python\Python37\lib\concurrent\futures\thread.py", line 57, in run
    result = self.fn(*self.args, **self.kwargs)
  File "SDM-RDFizer\rdfizer\rdfizer\semantify.py", line 3057, in semantify_postgres
    output_file_descriptor.write(triple.encode("utf-8"))
TypeError: write() argument must be str, not bytes

I'm sure python can handle unicode better (not relying on exceptions on top of exceptions), it's just a matter of figuring out what options to pass to the database.

I see this at the database level (as you can see, the encoding and collation don't quite match): image

And here's part of the table definition:

CREATE TABLE _final.customer
(
    customer_id character varying(255) COLLATE pg_catalog."default" NOT NULL,
    first_name character varying(255) COLLATE pg_catalog."default",
   ...

When I query with pgAdmin, I see accented chars like Klötzlmüllerstr.

eiglesias34 commented 4 years ago

I found another possible solution. The exceptions are still present but I'll remove then if this solution works. Please tell me if it works.

VladimirAlexiev commented 4 years ago

I get the same error as the first time:

Traceback (most recent call last):
  File "../SDM-RDFizer/rdfizer/run_rdfizer.py", line 3, in <module>
    semantify(str(sys.argv[1]))
  File "SDM-RDFizer\rdfizer\rdfizer\semantify.py", line 3402, in semantify
    number_triple += executor.submit(semantify_postgres, row, row_headers, triples_map, triples_map_list, output_file_descriptor, wr, config[dataset_i]["name"],config[dataset_i]["user"], config[dataset_i]["password"], config[dataset_i]["db"], config[dataset_i]["host"]).result()
  File "Python\Python37\lib\concurrent\futures\_base.py", line 432, in result
    return self.__get_result()
  File "Python\Python37\lib\concurrent\futures\_base.py", line 384, in __get_result
    raise self._exception
  File "Python\Python37\lib\concurrent\futures\thread.py", line 57, in run
    result = self.fn(*self.args, **self.kwargs)
  File "SDM-RDFizer\rdfizer\rdfizer\semantify.py", line 2644, in semantify_postgres
    subject_value = string_substitution_array(triples_map.subject_map.value, "{(.+?)}", row, row_headers, "subject")
  File "SDM-RDFizer\rdfizer\rdfizer\functions.py", line 289, in string_substitution_array
    print(value)
  File "Python\Python37\lib\encodings\cp1251.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\xe4' in position 28: character maps to <undefined>
eiglesias34 commented 4 years ago

Could you be so kind to run it again without the print?. The change I did was regarding the creation of the document, so there might be problems if you try to print the string directly. From what I determine the character in question is an "ä".