Open madduck opened 1 year ago
Arguably, Docspell should use
Content-Type: application/applefile
as and if provided
I disagree here. It should really never just use anything provided in an e-mail. It can be completely wrong (and often is in my experience). It can also be easily faked.
I think this case is quite special, in that a very weird file is presented to docspell. I don't thnik it is a big problem at all. People should not flood docspell with such files in the first place.
It incorrectly classifies it as a text file and then fails to decode its bytes into characters. This is currently an unexpected failure and the only sensible thing, imho, is to stop (after some general retry, because the processing engine doesn't know better). Then there will be an item created and this weird file is also present.
I don't think it is a good idea to keep going on unexpected errors. Instead we can make exactly this error an expected one (it should be) and skip the attachment when it happens. So decoding bytes into characters can fail and this failure must be obvious and handled in the processing. But I wouldn't like to see every error like this.
Arguably, Docspell should use
Content-Type: application/applefile
as and if providedI disagree here. It should really never just use anything provided in an e-mail. It can be completely wrong (and often is in my experience). It can also be easily faked.
What's the attack vector here? What's the problem?
I think this case is quite special, in that a very weird file is presented to docspell. I don't thnik it is a big problem at all. People should not flood docspell with such files in the first place.
When Docspell slurps Emails up via IMAP or receives them with SMTP, it is quite possible that such files appear without the user knowing about it beforehand.
It incorrectly classifies it as a text file and then fails to decode its bytes into characters. This is currently an unexpected failure and the only sensible thing, imho, is to stop (after some general retry, because the processing engine doesn't know better). Then there will be an item created and this weird file is also present.
I don't think it is a good idea to keep going on unexpected errors. Instead we can make exactly this error an expected one (it should be) and skip the attachment when it happens. So decoding bytes into characters can fail and this failure must be obvious and handled in the processing. But I wouldn't like to see every error like this.
This is relevant to #2404 rather than this issue, which is only about the misidentification.
What's the problem?
The problem is, that this information is often wrong. It will then fail just as well.
If this issue is only about the file type detection, it should be filed at tika.
I've registered for an account with Apache Jira. It wasn't clear to me who's doing what in the pipeline. Is there a document that illustrates the processing pipeline?
Not a very detailed one as these get outdated way to quickly. The mime type detection is done via tika as I said previously. It is not only looking at the filename. But things can go wrong here 🤷🏼 I suppose there will be other files in the future that will be not correctly detected, but for me at least this is not a big issue. They can't be processed anyways, so I wouldn't bother too much with raising issues.
I've raised https://issues.apache.org/jira/browse/TIKA-4172
I agree, @eikek, that there will be other files misclassified in the future, This is why I think #2404 is important.
So I spent some time with tika. Version 2.9.1 is running on localhost:9998
, and I pinpointed the problem:
% curl -T ~/.tmp/d0101c66_mySQL40.sql http://localhost:9998/meta/Content-Type
Content-Type,application/octet-stream
but:
% curl -T ~/.tmp/d0101c66_mySQL40.sql -H "Content-Disposition: attachment; filename=d0101c66_mySQL40.sql" http://localhost:9998/meta/Content-Type
Content-Type,text/x-sql; charset=IBM424
so indeed, the filename hint is too persuasive to tika. Will experiment…
@eikek why are you passing metadata to tika at all?
If I don't set TikaCoreProperties.RESOURCE_NAME_KEY
of the metadata passed to tika to the filename (but I am actually keeping the advertised content type), then this particular, problematic file is not misclassified and instead included as application/octet-stream
.
why are you passing metadata to tika at all?
Well, why not :)? I trust tika in doing a good job (which it does!) and give the information I have. Not sure if I would want to change this only because of this file.
Well, why not :)?
Because tika actually doesn't do a good job, and it doesn't claim it tries. Look here at the docs:
Where the name of the file is known, it is sometimes possible to guess the file type from the name or extension. Within the tika-mimetypes.xml file is a list of patterns which are used to identify the type from the filename.
However, because files may be renamed, this method of detection is quick but not always as accurate.
Moreover, you've previously argued that you wouldn't trust the Content-Type
provided in an email, but you do trust the filename provided?
By the way, it's not only this file. There are several examples of when detection goes wrong, but they all seem fixed when the filename isn't passed.
Because tika actually doesn't do a good job, and it doesn't claim it tries
In my opinion, tika does a very good job. There is no perfect thing. If you read further down the docs you posted, you'll see that it not only does name based recognition.
Moreover, you've previously argued that you wouldn't trust the
Content-Type
provided in an email, but you do trust the filename provided?
You wanted the content type to be used as is from the mail, which I objected. Not sure what you want to suggest here.
By the way, it's not only this file. There are several examples of when detection goes wrong, but they all seem fixed when the filename isn't passed.
What do you propose? Not giving any information to tika, no filename no given content type? What is the correct thing to do for every possible file that arrives?
One important point is "seem fixed" for me. It is more important to detect a supported file correctly than to detect an unsupported file correctly. Throwing unsupported files at docspell that are not handled perflectly well, is not such an important issue to me.
I believe that tika has a bug in this case, which is why I filed an issue with Jira. Let's see where that leads.
My point about the content-type from mail vs. filename is that if you also fed the content-type of the attachment to tika, it would do the right thing. But you didn't want to do that, because it can be faked. You are, however, feeding it the filename, and only the filename. This can be faked just as easily.
I guess I am trying to suggest: all or nothing, but doing only the filename and not the content-type does seem a bit arbitrary. We need to get the Jira bug fixed. Meanwhile, I am working around the issue by ignoring the filename. Long-term, I think it'd be okay to just rely on the content analysis and not provide any metadata, or to provide everything we have.
Do you have an example of a "supported file" where you think it's paramount to provide the filename to tika, because it wouldn't be able to classify cased on content?
You are, however, feeding it the filename, and only the filename. This can be faked just as easily.
No, this is not true... not sure why you keep saying this. Tika gets all the information available, the file contents, the given content-type and the filename.
I am sorry @eikek if I am not understanding, and creating extra work for you. I was looking at https://github.com/eikek/docspell/blob/master/modules/files/src/main/scala/docspell/files/TikaMimetype.scala#L46 and saw only HTTPHeaders
, but it occured to me only now that the hint.advertised
property is the source, and if you're saying that this will be application/applefile
in the example up top, the I think tika really has it backwards, because it really should not prefer a filename match over an explicit content-type…
This issue is being carried over from #2376, and is about the situation wherein during email processing is being mis-identified. It's a minor issue in and of itself, but might be symptomatic of a bigger problem.
Below you will find an email that encodes an attachment of a file named
d0101c66_mySQL40.sql
. Docspell currently classifies this as atext/x-sql
file, which it isn't.libmagic
5.45 thinks it is a binary of type "MacBinary III INVALID date" (possibly corrupt), which is close(r), because the file is in fact binary.To recreate the file, take the three Base64-encoded lines from the email and pipe them through a Base64-decoder, such as
base64 -d
.Arguably, Docspell should use
Content-Type: application/applefile
as and if provided. In the absence of an explicitContent-Type
header, any uncertain file should probably becomeapplication/octet-stream
. Binary files should definitely not becometext/*
files.