destream-py / destream

A tool & Python 3 library to decompress anything
GNU General Public License v2.0
12 stars 1 forks source link

Add macos to GHA (and discover a libmagic bug) #20

Closed eumiro closed 3 years ago

eumiro commented 3 years ago

Disclaimer: I have no experience with macos, so just trying…

eumiro commented 3 years ago

This is interesting! macos-latest fails exactly at the same place as an Arch-based Linux distro (more details below the error message):

  =================================== FAILURES ===================================
  _____________________ GuesserTest.test_30_zip_single_file ______________________

  self = <tests.test_30_decompressors.GuesserTest testMethod=test_30_zip_single_file>

      def test_30_zip_single_file(self):
          uncompressed = BytesIO(b"Hello World\n")
          uncompressed.name = 'test_file'
          raw = BytesIO()
          raw.name = "test_file.zip"
          zip = zipfile.ZipFile(raw, 'w')
          try:
              zip.writestr("test_file", uncompressed.getvalue())
          finally:
              zip.close()
          raw.seek(0)
          self._check_decompressor(
              destream.decompressors.Unzip,
  >           raw, uncompressed)

  tests/test_30_decompressors.py:301: 
  _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
  tests/test_30_decompressors.py:31: in _check_decompressor
      decompressor._guess(mime, str(archive.realname), compressed_fileobj)
  _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

  cls = <class 'destream.decompressors.zip.Unzip'>
  mime = 'application/octet-stream', name = 'test_file.zip'
  fileobj = <_io.BytesIO object at 0x110e6c350>

      @classmethod
      def _guess(cls, mime, name, fileobj):
          if getattr(cls, '_unique_instance', False):
              if cls in fileobj._decompressors:
                  raise ValueError("class %s already in the decompressor list")
          realname = name
          if hasattr(cls, '_mimes'):
              match = RE_EXTENSION.search(name)
              if hasattr(cls, '_extensions') and \
                 match.group(2) and \
                 os.path.normcase(match.group(3)) in cls._extensions:
                  realname = match.group(1)
              if mime not in cls._mimes:
                  raise ValueError(
                      (cls, mime, name, fileobj),
  >                   "can not decompress fileobj using class %s" % cls.__name__)
  E               ValueError: ((<class 'destream.decompressors.zip.Unzip'>, 'application/octet-stream', 'test_file.zip', <_io.BytesIO object at 0x110e6c350>), 'can not decompress fileobj using class Unzip')

  destream/archive.py:65: ValueError

It says it cannot find application/octet-stream in the list of ['application/zip'] provided by the Unzip class. In Ubuntu this does not fail, so there might be some difference in libmagic.

The difference between ubuntu/debian and arch/macos is in the packaging of the libmagic.so file:

Anyone has an idea?

jruere commented 3 years ago

I have:

lrwxrwxrwx 1 root root   17 Jun 16  2020 /usr/lib/libmagic.so -> libmagic.so.1.0.0
lrwxrwxrwx 1 root root   17 Jun 16  2020 /usr/lib/libmagic.so.1 -> libmagic.so.1.0.0
-rwxr-xr-x 1 root root 162K Jun 16  2020 /usr/lib/libmagic.so.1.0.0

So it seems equivalent.

file works so it's not the library.

I don't see anything wrong in the test of library. The ArchiveFile.peek() (line 30) returns the 128B that are the entire file.

jruere commented 3 years ago

From python-magic: https://github.com/ahupp/python-magic/issues/166

But I could not reproduce on Arch 2021-01-11.

eumiro commented 3 years ago

@jruere what do you get in the following code?

❯ python
>>> import zipfile, magic
>>> z = zipfile.ZipFile('a.zip', 'w')
>>> z.writestr("a.txt", "hello world\n")
>>> z.close()
>>> magic.from_buffer(open("a.zip", 'rb').read(), mime=True)
'application/octet-stream'
>>> 
❯ file --mime-type a.zip
a.zip: application/octet-stream

But:

❯ echo "hello world" > b.txt
❯ zip b.zip b.txt
  adding: b.txt (stored 0%)
❯ file --mime-type b.zip
b.zip: application/zip

and then:

❯ file a.zip b.zip
a.zip:    Zip archive data, made by v2.0 UNIX, extract using at least v2.0, last modified Sun Sep  8 17:24:03 2013, uncompressed size 12, method=store
b.zip:    Zip archive data, at least v1.0 to extract
jruere commented 3 years ago

I can reproduce the problem with the procedure you gave.

Finally,

$ file a.zip b.zip
a.zip: Zip archive data, made by v2.0 UNIX, extract using at least v2.0, last modified Sun Sep  8 15:13:14 2013, uncompressed size 12, method=store
b.zip: Zip archive data, at least v1.0 to extract

This looks like a but in libmagic...

cecton commented 3 years ago

I don't have a machine running on OSX so I can't really help but I remember I had numerous issues with the difference of versions of libmagic. Some detected mime types properly, some didn't, this is the reason why I was updating file and libmagic on the CI.

It is very likely that file is a UNIX command and therefore the OSX version of it has a completely different implementation than the GNU version for Linux. This is the case for other commands like sed and tar which are not 100% compatible between OSX and Linux.

I don't really have a solution for this. You might want to tell the user to install the GNU version on OSX and say "this is the only version we support officially".

One other alternative would be to implement your own MIME detection mechanism but this might bring other problems. For example, ZIP files always start with the bytes "PK", that one is easy to identify.

ahupp commented 3 years ago

Can you share a.zip? What it looks like is that there's a magic database entry for this second format, but the mimetype wasn't setup properly.

eumiro commented 3 years ago

Can you share a.zip? What it looks like is that there's a magic database entry for this second format, but the mimetype wasn't setup properly.

With the following script in Python 3.9:

import zipfile
with zipfile.ZipFile('a.zip', 'w') as zf:
    zf.writestr('a.txt', 'hello world\n')

I get a 120 Bytes large file a.zip:

❯ file --mime-type a.zip
a.zip: application/octet-stream

And this is its base64 version:

UEsDBBQAAAAAAE2dLVItOwivDAAAAAwAAAAFAAAAYS50eHRoZWxsbyB3b3JsZApQSwECFAMUAAAA
AABNnS1SLTsIrwwAAAAMAAAABQAAAAAAAAAAAAAAgAEAAAAAYS50eHRQSwUGAAAAAAEAAQAzAAAA
LwAAAAAA

the md5sum of the file is 64561ffd00255a30ffaa38acc9867eed

Thank you for looking at the problem!

ahupp commented 3 years ago

Looks like a regression in libmagic 5.39:

% docker run -it archlinux:latest /bin/bash
[root@a6dd08f21d72 /]# file --version
file-5.39
magic file from /usr/share/file/misc/magic
seccomp support included
[root@a6dd08f21d72 /]# file --mime-type a.zip
a.zip: application/octet-stream

vs

 % docker run -it archlinux:20200505 /bin/bash
[root@e8f497cea4ca /]# file --version
file-5.38
magic file from /usr/share/file/misc/magic
seccomp support included
[root@e8f497cea4ca /]# file --mime-type a.zip
a.zip: application/zip
eumiro commented 3 years ago

Thank you, @ahupp! That explains why it works on Ubuntu 20.04 (libmagic1 5.38) and not on arch-based distro (file 5.39). What can we do about it?

ahupp commented 3 years ago

I'm reporting a bug upstream, but for now, I don't know if there's anything you can do about it.

ahupp commented 3 years ago

https://bugs.astron.com/view.php?id=228

cecton commented 3 years ago

@eumiro you can add a note to the troubleshooting section 😁 https://github.com/destream-py/destream#troubleshooting

eumiro commented 3 years ago

Now we'll need to find a way to brew install a specific version (5.38) of libmagic. I cannot test it and searching online points me to some git checkout hacks. Any idea?

eumiro commented 3 years ago

I think we can introduce the macos CI because #27 will then correctly xfail the test.

cecton commented 3 years ago

Now we'll need to find a way to brew install a specific version (5.38) of libmagic. I cannot test it and searching online points me to some git checkout hacks. Any idea?

Did you see this? https://stackoverflow.com/questions/3987683/homebrew-install-specific-version-of-formula

Maybe try brew install libmagic@5.38?

Apparently you can do brew versions libmagic to see what is available.

cecton commented 3 years ago

I think we can introduce the macos CI because #27 will then correctly xfail the test.

Very good point! haha That sounds like a perfectly acceptable solution for me.

(I wouldn't suggest to make a Windows CI check just yet because I suspect the test code will fail to even execute properly.)

eumiro commented 3 years ago

Maybe try brew install libmagic@5.38?

That's it, thanks.

I will now pin both linux and macos to libmagic version 5.38. As soon as there's a problem installing it or we get a new fixed version of libmagic, we can deal with it again.

cecton commented 3 years ago

I feel trapped every time I click on the link "View it on GitHub" because I get that page that says there is nothing to see :confounded: https://twitter.com/CecileTonglet/status/1348595584136077314

eumiro commented 3 years ago

I feel trapped every time I click on the link "View it on GitHub" because I get that page that says there is nothing to see confounded https://twitter.com/CecileTonglet/status/1348595584136077314

I am sorry. But who wrote about quick dirty commits that get cleaned up afterwards? :thinking:

cecton commented 3 years ago

I am sorry. But who wrote about quick dirty commits that get cleaned up afterwards? :thinking:

:innocent:

Now my PRs look more like this: https://github.com/IMI-eRnD-Be/wasm-run/pull/28 You can read the intermediary commits easily but I squash-merged at the end to https://github.com/IMI-eRnD-Be/wasm-run/commit/fffb646d8858bb1d39445e11dac19c2d55292580