Closed zoj613 closed 4 years ago
Weird. Can you share the implementation of text_api/extractor_factory.py? Is file_bytes a regular bytes/str object?
Weird. Can you share the implementation of text_api/extractor_factory.py? Is file_bytes a regular bytes/str object?
Unfortunately I cant since I don't own it but I can give a good idea: Its basically a module that uses the factory method design pattern to implement a text extraction mechanism from email attachment data. The attachment file data is a bytes object as obtained by calling python's EmailMessage.get_content()
method on the attachment object. Most attachments come with correct mimetypes so that it can be used by the factory to call the appropriate text exctractor. Some attachments have the incorrect mimetype application/octet-stream
even when its should be, for example application/pdf
. So your library is used to infer the correct mimetype of the attachment file bytes and then this inferred value is used by the factory to call the appropriate text extraction technique. For some reason while extracting text from many email attachment bytes, I ran into that exception. I dont know what exactly happened. I decided to catch the exception and skip files that trigger it for now until I find what went wrong.
Im thinking maybe magic infers the same application/octet-stream
mimetype for some of these files so the factory then recursively calls for magic to do the mimetype inference over and over until it errors out
Yeah it looks like the functions that's being recursively called is your to_text() function, not something inside python-magic. If that's the issue, I'd suggest restructuring to remove the recusive call.
I'm going to close this issue since it looks like app code, but lmk if you need more help
I'm getting a weird error and I believe its coming from using this library in my code. I use it to guess the correct mimetype of email attachment file byte that are labelled with the wrong mimetype
application/octet-stream
. Some of them dont have an extension in their reported file names so this library helps figure out what documents they are. Here is some of the stacktrace:then python gives this exception:
ArgumentError: argument 3: <class 'RecursionError'>: maximum recursion depth exceeded while calling a Python object
My suspicion is that
len(buf)
from this linereturn _magic_buffer(cookie, buf, len(buf))
is causing thisWhat could be cause of this? Im using the latest version of the library on Ubuntu 18.04