bloomonkey / oai-harvest

Python package for harvesting records from OAI-PMH provider(s).
Other
62 stars 41 forks source link

metadata output not exactly in utf8 encoding... #33

Open sdm7g opened 4 years ago

sdm7g commented 4 years ago

metadata output seems to be in ascii with other unicode characters encoded as numerical character entities. Legal for default utf8 encoding, as ascii is a subset, but this is not what I, and I think most people want or expect. ( This may be the same issue reported as #32 . This was also reported to me by Columbia.edu and I was able to reproduce it on both my and their OAI feeds. )

I initially tried adding encoding="UTF-8" to etree.tostring call in metadata.py but this worked under python3.x, but failed under python2.x .

adding encoding="unicode" appears to be the correct fix that seems to work under both python2.x and python3.x .

Under python2.x , encoding="UTF-8" returns a <type "str"> that contains unicode characters, which then may give an error when coercing to <type "unicode"> . encoding="unicode" returns <type "unicode"> .

See: https://github.com/sdm7g/oai-harvest/blob/fix-pyoai/oaiharvest/metadata.py#L51-L53