equinor / segyio

Fast Python library for SEGY files.
Other
476 stars 214 forks source link

tools.wrap introduces \n at wrong place #444

Closed GGDRriedel closed 4 years ago

GGDRriedel commented 4 years ago

My Original text header looks something like this

segyfile.text[0]
bytearray(b'C 1                               PRAKLA-SEISMOS                9350484      
...

Running it through

raw_header=  segyio.tools.wrap(segyfile.text[0],width=80)

delivers

raw_header "bytearray(b'C 1 PRAKLA-SEISMOS\n9350484 C 2 LINE

It basically cuts the whitespaces and creates a newline there.

This seems to be a bug related to how return '\n'.join(textwrap.wrap(str(s), width=width))

is used

jokva commented 4 years ago

Hah, you're right, we're doing str(s) on the input to coerce it to a string, which returns the string "bytearray(...)", instead of decoding the bytearray as a string.

Unfortunately, decoding as a string by default is not really ideal either, because text headers in becdic often contain 0x80, making ascii/utf8 decoding fail.

I'll play with some alternative implementations.

jokva commented 4 years ago

Ok, so the removal of whitespace is expected, as it is what textwrap does. This is not documented, so I feel entitled to change it, but wrap is (usually?) done specifically for printing purposes, so I think removing them should be acceptable.

Now, the str() is obviously wrong, and should be replaced with a decode. I think the right call here would be to ignore decoding errors, and if that is unacceptable, fall back to letting users write their own, fit-for-purpose wrap.

GGDRriedel commented 4 years ago

I am going by one of the notebooks here: https://github.com/equinor/segyio-notebooks/blob/master/notebooks/basic/02_segy_quicklook.ipynb

The problem is that it removes the whitespaces(\s) in the "PRAKLA-SEISMOS\s\s\s\s\s\s\s\s\s\s9350484\s\s\s\s\s" string, despite them being well under the character limit of 80, so I don't really get how that is expected, shouldn't it kind of only cut the trailing whitespaces, not all inbetween?

jokva commented 4 years ago

With the "bytestring('b" prepended, your first line gets too long. :-----)

GGDRriedel commented 4 years ago

Ah I see. Seems like getting the encoding from the file and delivering that to the decode method would be the way to go forward.

https://docs.python.org/3/library/stdtypes.html references the decode method For my example, changing the line raw_header = segyio.tools.wrap(segyfile.text[0])

to

raw_header = segyio.tools.wrap(segyfile.text[0].decode('cp437'))

worked out.

jokva commented 4 years ago

Yep, but it's quite unintuitive that it doesn't just work with text[0], so I might add a check in there which tries to decode-with-ignore if it is still a byte array.

GGDRriedel commented 4 years ago

All right, cool! Maybe we can rename the issue

jokva commented 4 years ago

https://github.com/equinor/segyio/pull/445 fixes this. Thanks for the report!

GGDRriedel commented 4 years ago

Thanks a bunch!