bloomonkey / oai-harvest

Python package for harvesting records from OAI-PMH provider(s).
Other
62 stars 41 forks source link

Encoding issue #18

Closed Phil1717 closed 7 years ago

Phil1717 commented 7 years ago

Good morning John,

I am using your project to fetch OAI-PMH data and I encounter this problem. It manages to pull about 200k entries and goes down on a single one systematically with this error:

ERROR 'ascii' codec can't encode character u'\xfc' in position 10: ordinal not in range(128)

I don't see that there are any options for me to deal with encoding issues. Otherwise filtering it out would be counter productive but I would happily just have the script skip these delinquent entries if they can't be transliterated.

Do you have any ideas?

Thank you for your time and your project, Phil

bloomonkey commented 7 years ago

Hi Phil Apologies for the slow reply. I've not been Are you able to share the full stack trace so that I can track down the offending portion of the code? Thanks, John

Phil1717 commented 7 years ago

I traced it down to line 174 in harvest.py

I figure that as it is, it should fail everytime it encounters UTF-8 data in the fetched data.

I tried rebuilding the project with the codecs module imported and using:

            with codecs.open(fp, 'w', 'UTF-8') as fh:
                fh.write(metadata)

But got this error: LookupError: setuptools-scm was unable to detect version for '/home/phil/Downloads/oai-harvest-develop'.

I ended up coding a rudimentary OAIPMH client with urllib, it's less than perfect though. I would love to go back to using yours if you end up adding this UTF8 fix.

bloomonkey commented 7 years ago

Are you able to share the OAI-PMH source that showed up the error?

Phil1717 commented 7 years ago

Of course: http://api.openaire.eu/oai_pmh