ahupp / python-magic

A python wrapper for libmagic
Other
2.64k stars 283 forks source link

ArgumentError: maximum recursion depth exceeded while calling a Python object #209

Closed zoj613 closed 4 years ago

zoj613 commented 4 years ago

I'm getting a weird error and I believe its coming from using this library in my code. I use it to guess the correct mimetype of email attachment file byte that are labelled with the wrong mimetype application/octet-stream. Some of them dont have an extension in their reported file names so this library helps figure out what documents they are. Here is some of the stacktrace:

  File "/home/tools/text_api/extractor_factory.py", line 175, in to_text
    return extractor.to_text(file_bytes)
  [Previous line repeated 966 more times]
  File "/home/tools/text_api/extractor_factory.py", line 172, in to_text
    mimetype = magic.from_buffer(file_bytes, mime=True)
  File "/home/.pyenv/versions/3.7.3/envs/metro/lib/python3.7/site-packages/magic.py", line 148, in from_buffer
    return m.from_buffer(buffer)
  File "/home/.pyenv/versions/3.7.3/envs/metro/lib/python3.7/site-packages/magic.py", line 80, in from_buffer
    return maybe_decode(magic_buffer(self.cookie, buf))
  File "/home/.pyenv/versions/3.7.3/envs/metro/lib/python3.7/site-packages/magic.py", line 255, in magic_buffer
    return _magic_buffer(cookie, buf, len(buf))
ctypes.ArgumentError: argument 3: <class 'RecursionError'>: maximum recursion depth exceeded while calling a Python object

then python gives this exception: ArgumentError: argument 3: <class 'RecursionError'>: maximum recursion depth exceeded while calling a Python object

My suspicion is that len(buf) from this line return _magic_buffer(cookie, buf, len(buf)) is causing this

What could be cause of this? Im using the latest version of the library on Ubuntu 18.04

ahupp commented 4 years ago

Weird. Can you share the implementation of text_api/extractor_factory.py? Is file_bytes a regular bytes/str object?

zoj613 commented 4 years ago

Weird. Can you share the implementation of text_api/extractor_factory.py? Is file_bytes a regular bytes/str object?

Unfortunately I cant since I don't own it but I can give a good idea: Its basically a module that uses the factory method design pattern to implement a text extraction mechanism from email attachment data. The attachment file data is a bytes object as obtained by calling python's EmailMessage.get_content() method on the attachment object. Most attachments come with correct mimetypes so that it can be used by the factory to call the appropriate text exctractor. Some attachments have the incorrect mimetype application/octet-stream even when its should be, for example application/pdf. So your library is used to infer the correct mimetype of the attachment file bytes and then this inferred value is used by the factory to call the appropriate text extraction technique. For some reason while extracting text from many email attachment bytes, I ran into that exception. I dont know what exactly happened. I decided to catch the exception and skip files that trigger it for now until I find what went wrong.

Im thinking maybe magic infers the same application/octet-stream mimetype for some of these files so the factory then recursively calls for magic to do the mimetype inference over and over until it errors out

ahupp commented 4 years ago

Yeah it looks like the functions that's being recursively called is your to_text() function, not something inside python-magic. If that's the issue, I'd suggest restructuring to remove the recusive call.

I'm going to close this issue since it looks like app code, but lmk if you need more help