edsu / pymarc

process MARC records from Python
http://python.org/pypi/pymarc
Other
251 stars 99 forks source link

extra bytes between indicators and first subfield #106

Closed termim closed 5 years ago

termim commented 6 years ago

Sometimes MarcEdit inserts extra bytes during record editing, see https://groups.google.com/forum/#!topic/pymarc/5zxuOh0fVuc Fix by decoding only first two bytes when there are more than two bytes before the first subfield.

edsu commented 6 years ago

@termim thanks very much for the work to figure out what's going on here. Instead of patching pymarc to work with these records I wonder if it makes sense to fix the problem in MarcEdit? @reeset what do you think?

reeset commented 6 years ago

If there is a problem, I'm happy to fix it, but I'd be surprised if that is what is happening. I'd need to see an example and the steps being taken. My guess is this might be user error or bad data. This is the kind of thing that would impact a lot of users so I'd likely have heard about it before. Is there more info on the forum thread?

reeset commented 6 years ago

Actually, I took a closer look at this, and I think I might know what this is about. For display, in the mnemonic format, marcedit inserts soft left to right markers so the operating system knows when to render data left to right. This is because, say in Arabic, $0 next to Arabic data will render incorrectly because the $0 is tagged as being read right to left. This is only present in the mnemonic format. It is removed by the compiler when turning data to marc or whatever....but these characters are required to render data correctly on a screen. I have with the update posted last night updated the function that deals with this placement to use a softer character break, but the byte is still there only in the mnemonic format as my international users were finding that the display of mnemonic data wasn't working correctly without it.

Anyway, I'm assuming this is what is being referenced. If you are reading the mnemonic format, you'll need to deal with the bytes when handling right to left languages.

edsu commented 6 years ago

@reeset thanks very much for looking into this! If I'm reading Heidi's example (n the thread that @termim mentioned above) I think she is reading in binary MARC. pymarc actually doesn't have any functionality to read mnemonic format MARC.

reeset commented 6 years ago

I'll take a look at the forum for example files because this data is filtered away when you save the mnemonic file, so it should never be in the binary file (as it would foul it)

reeset commented 6 years ago

I've answered on the forum. I'll make some changes to how the data is processed in the editor. I'm not sure how the byte would have ended up in the binary file, but I've noted why they are there and how the program has modified how this process is filtered. I cannot recreate data being embedded, so, I'd say, Heidi should update (the update will filter online in a couple hours) and then let me know if she still is seeing the error. Maybe the task processing was missing the bit that tells the program that the mark was embedded. If that happens, then it should still be caught.

edsu commented 6 years ago

This is awesome @reeset! Thanks again for looking into this so quickly, and for pushing out the update. I'll leave this ticket open until Heidi has a chance to respond.

termim commented 6 years ago

I think both patching pymarc and fixing marcedit make sense. Who knows how many such files already are floating around.

On Aug 19, 2017, at 5:48 AM, Ed Summers notifications@github.com wrote:

@termim thanks very much for the work to figure out what's going on here. Instead of patching pymarc to work with these records I wonder if it makes sense to fix the problem n MarcEdit? @reeset what do you think?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

edsu commented 6 years ago

I agree that this ought to be merged to handle any data that's out there that MarcEdit has created. I'd love to have a test for it though.

edsu commented 6 years ago

I think I can add one using the record that Heidi initially provided.

edsu commented 6 years ago

Ok, I've added a test, and interestingly it and test_regression_45 seem to pass, but throw warnings like this:

WARNING:root:more than 2 indicators found: b' 4\xe2\x80\xaa\xe2\x80\xaa\xe2\x80\xaa\xe2\x80\xaa\xe2\x80\xaa\xe2\x80\xaa\xe2\x80\xaa\xe2\x80\xaa\x1f6650-08\x1fa\xd8\xa7\xd9\x84\xd9\x85\xd8\xb3\xd9\x8a\xd8\xad\xd9\x8a\xd8\xa9 \xd9\x88\xd8\xa7\xd9\x84\xd8\xaf\xd9\x8a\xd8\xa7\xd9\x86\xd8\xa7\xd8\xaa \xd8\xa7\xd9\x84\xd8\xa7\xd8\xae\xd8\xb1\xd9\x89\xe2\x80\xaa\xe2\x80\xaa\xe2\x80\xaa\xe2\x80\xaa\xe2\x80\xaa\xe2\x80\xaa\xe2\x80\xaa\xe2\x80\xaa\x1fx\xd8\xa7\xd9\x84\xd8\xa7\xd8\xb3\xd9\x84\xd8\xa7\xd9\x85.'

I'd like to figure out if that's desired behavior or not before merging this.

termim commented 6 years ago

The behavior in this case (more than two indicators fount) is unchanged from before the patch - print warning with the whole entry content, take first two bytes as indicators and continue. IMHO it is reasonable enough.

edsu commented 6 years ago

I thought that the record fails to load before the patch?

termim commented 6 years ago

Yes it fails if extra bytes can't be decoded using ascii.

edsu commented 5 years ago

Somehow I lost the thread on this. Is there still work to do on this?

edsu commented 5 years ago

Closing due to inactivity. Please reopen, or create a new issue if needed.