edsu / pymarc

process MARC records from Python
http://python.org/pypi/pymarc
Other
251 stars 99 forks source link

A permissive MARCReader #109

Closed edsu closed 4 years ago

edsu commented 6 years ago

Over in #89 there has been a long discussion about working with invalid MARC data in the wild. I must admit I don't work with MARC much these days, so I had no idea people were running into so many problems processing large batches of MARC records.

Currently when pymarc runs into a structural problem in a MARC record it will throw an exception (RecordLeaderInvalid, BaseAddressNotFound, BaseAddressInvalid, RecordDirectoryInvalid, NoFieldsFound) which will also cause record iteration to stop.

@anarchivist offered up a PermissiveMARCReader which he has used to process large amounts of MARC data. PermissiveMARCReader catches all the exceptions thrown by structural problems with the MARC record and moves on to the next record.

Rather than introducing a new class I suggest that a new parameter named strict be added to the MARC.MARCReader constructor. When set to True it will continue to throw these exceptions. When set to False it will catch the exceptions, log them, and move on to the next record. It may be that some of these exceptions need to be relaxed, and the invalid data interpreted in some way. But let's open new issue tickets for those situations as they come up.

Based on the conversation we've been seeing lately I think the default for strict should be set to False. The MARCReader API will be backwards compatible (code that uses pymarc won't need to change). However this will be a significant change in behavior so I think a new minor version release will be needed, v3.2.

What do folks think?

timClicks commented 5 years ago

Strong +1 from me. I deal with MARC21 files from repositories, so it's basically impossible for me to fix the problem at source. Its rare that I would like to stop processing completely if I'm generating an export.

petrus-v commented 4 years ago

Hi ! I'd like to help on this issue, I'll create a PR in few days, do you have any contribution guide lines to know ?

edsu commented 4 years ago

@petrus-v PEP-8 and including unit tests (with test data if necessary) will definitely help move PRs along.

Wooble commented 4 years ago

I don't know if there's anything that could be done to be even more permissive in more places, but I think #144 pretty much covered this, so closing...