edsu / pymarc

process MARC records from Python
http://python.org/pypi/pymarc
Other
252 stars 98 forks source link

as_marc() should throw exception on fields that are too big #42

Open nahuelange opened 10 years ago

nahuelange commented 10 years ago

I use pymarc to generate iso2709 of records from internals datas. When I read my records generated by pymarc (even with pymarc!) I have directory offset problem. With yaz-marcdump the error is: (Directory offset 204: Bad value for data length and/or length starting (394\x1E##\x1Fa9782)) (Base address not at end of directory, base 194, end 205) (Directory offset 132: Data out of bounds 51665 >= 15063)

What's wrong ?

edsu commented 10 years ago

Can you share some code that exhibits the problem?

nahuelange commented 10 years ago

You can find the generated record here: https://filez.ahtna.org/ukh5 Then, for the code, I can give you some part of the code, but just I tested with Record(force_utf8=True) and False, the result is approximatly the same.

edsu commented 10 years ago

Yes, please share the code so we can replicate the problem.

nahuelange commented 10 years ago

One part of the code is: https://gist.github.com/nahuelange/c10a28d62145389d3e35

I can't really share the data, pickle the Record object can be usefull for you?

edsu commented 10 years ago

Please see if you can write a piece of standalone code that demonstrates the problem. Then we can help, hopefully :-)

nahuelange commented 10 years ago

You can find here an exemple that reproduct the problem: https://gist.github.com/nahuelange/d36c15d57e82c6e006b4

edsu commented 10 years ago

When I try to read the resulting record in with pymarc I see this error:

pymarc.exceptions.RecordDirectoryInvalid: Invalid directory

Do you see the same thing?

nahuelange commented 10 years ago

Yes, with pymarc I have this exception, and dumping in a file and reading it with yaz-marcdump I have this: 38934 2200038 4500 (Directory offset 36: Bad value for data length and/or length starting (0\x1E##\x1Fa012345)) (Base address not at end of directory, base 38, end 37) (Directory offset 24: Data out of bounds 53926 >= 38934)

edsu commented 10 years ago

I played around with your example a bit and simplified it to this https://gist.github.com/edsu/f8f0e33afcbbcaf7d194 do you see the same error with thta?

nahuelange commented 10 years ago

I have this error: RecordDirectoryInvalid: Invalid directory

edsu commented 10 years ago

Now change the 9995 to 9994, and it works? No error?

nahuelange commented 10 years ago

Well, and then? I know that the notice length is too big, but pymarc should not raise an exception…

edsu commented 10 years ago

You are right, a better exception should be thrown by pymarc when you call as_marc(). But the structure of a MARC21 directory does not support fields of that size.

nahuelange commented 10 years ago

Well, it's absurd, we are in 2013, this format exists since 1960's, we should just don't care about this directory that is strictly useless to read and write MARC.

edsu commented 10 years ago

I completely agree. Do you need to write MARC21 or can you use something else like MARCXML?

nahuelange commented 10 years ago

We write MARC21 and UNIMARC to provides iso2709 records to libraries that subscribe our services. Well, in any case, we have to truncate some big fields, because ILS are not able to read this … format.

Thanks,

edsu commented 10 years ago

I want to leave this open to get a better exception being thrown. It wasn't at all clear what the problem was. I apologize for the suckitude of the MARC21 format. The sooner it can be a thing of the past the better. pymarc was largely written to be an escape mechanism, not a means to perpetuate the format.

anarchivist commented 10 years ago

The maximum length of data in a variable field in UNIMARC -- as well as MARC21 -- is 9,999 bytes (see the bottom of this page, in the "Directory Map" section). You cannot serialize a MARC record into MARC21 or UNIMARC and have a field over this length because the data format cannot handle that. The field length includes the indicators as well as all of the subfields.

To serialize this data, your approach in this case would probably be to break each translation of the description text into a separate tag by language.

edsu commented 10 years ago

@anarchivist I thought we already covered that.

nahuelange commented 10 years ago

@anarchivist This is not really right in 2013, we do not need the length of the record, we can just use separators to delimitate each field/subfield. The length of the record and field was used in VERRY OLD systems, that are no more used today. This format is a PITA

nahuelange commented 10 years ago

Look at the code here: https://github.com/eiro/p5-marc-mir/blob/master/lib/MARC/MIR.pm#L157

It doesn't use the length of the record to parse iso2709 and create a struct from it.

anarchivist commented 10 years ago

@edsu In terms of considering the exception, it looks like pymarc should probably compare leader byte 20 to the length of a given pymarc.Field's as_marc() return value.

@nahuelange Like @edsu, I'm sorry that this is frustrating, but this is a limitation of the format. If you really need to have values longer than what UNIMARC serialized in ISO 2709 will allow, please consider using MARCXML.

nahuelange commented 10 years ago

@anarchivist It's not a format limitation, because it's a useless information today. To be interroperable we CAN'T do MARCXML, we provide our records to partners that uses ILS.

anarchivist commented 10 years ago

@nahuelange Please don't get frustrated with me, I'm trying to help you by explaining the limitations. If you'd like to develop a workaround using pymarc, by all means do so.

edsu commented 10 years ago

@nahuelange as @anarchivist suggested you could consider shortening the fields so they fit within the constraints. Unfortunately MARC interchange format is frozen in time. If you have to work with legacy systems that use it, you will have to work within the constraints of the format. If you are designing a new system I would strongly advise you not to perpetuate its use.

nahuelange commented 10 years ago

The problem is we can't predict the size of the record.

edsu commented 10 years ago

Can you describe your use case a bit more?