eikek / docspell

Assist in organizing your piles of documents, resulting from scanners, e-mails and other sources with miminal effort.
https://docspell.org
GNU Affero General Public License v3.0
1.65k stars 127 forks source link

File type detection flaw during email/archive processing #2403

Open madduck opened 1 year ago

madduck commented 1 year ago

This issue is being carried over from #2376, and is about the situation wherein during email processing is being mis-identified. It's a minor issue in and of itself, but might be symptomatic of a bigger problem.

Below you will find an email that encodes an attachment of a file named d0101c66_mySQL40.sql. Docspell currently classifies this as a text/x-sql file, which it isn't. libmagic 5.45 thinks it is a binary of type "MacBinary III INVALID date" (possibly corrupt), which is close(r), because the file is in fact binary.

To recreate the file, take the three Base64-encoded lines from the email and pipe them through a Base64-decoder, such as base64 -d.

Arguably, Docspell should use Content-Type: application/applefile as and if provided. In the absence of an explicit Content-Type header, any uncertain file should probably become application/octet-stream. Binary files should definitely not become text/* files.

Date: Sat, 11 Nov 2023 21:54:36 +0100
From: invalid@example.org
To: invalid@example.org
Subject: Docspell test email
Message-ID: <20231111205436.oefruzlox57xssbt@example.org>
MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="holmsj2d6wm2cnnr"
Content-Disposition: inline

--holmsj2d6wm2cnnr
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline

Check attachment

--holmsj2d6wm2cnnr
Content-Type: application/applefile
Content-Disposition: attachment; filename="d0101c66_mySQL40.sql"
Content-Transfer-Encoding: base64

ABRkMDEwMWM2Nl9teVNRTDQwLnNxbAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAbUJJTgAA
AAAAAAAAAAAAAAAAAACCgf+/AAA=

--holmsj2d6wm2cnnr--
eikek commented 1 year ago

Arguably, Docspell should use Content-Type: application/applefile as and if provided

I disagree here. It should really never just use anything provided in an e-mail. It can be completely wrong (and often is in my experience). It can also be easily faked.

I think this case is quite special, in that a very weird file is presented to docspell. I don't thnik it is a big problem at all. People should not flood docspell with such files in the first place.

It incorrectly classifies it as a text file and then fails to decode its bytes into characters. This is currently an unexpected failure and the only sensible thing, imho, is to stop (after some general retry, because the processing engine doesn't know better). Then there will be an item created and this weird file is also present.

I don't think it is a good idea to keep going on unexpected errors. Instead we can make exactly this error an expected one (it should be) and skip the attachment when it happens. So decoding bytes into characters can fail and this failure must be obvious and handled in the processing. But I wouldn't like to see every error like this.

madduck commented 1 year ago

Arguably, Docspell should use Content-Type: application/applefile as and if provided

I disagree here. It should really never just use anything provided in an e-mail. It can be completely wrong (and often is in my experience). It can also be easily faked.

What's the attack vector here? What's the problem?

I think this case is quite special, in that a very weird file is presented to docspell. I don't thnik it is a big problem at all. People should not flood docspell with such files in the first place.

When Docspell slurps Emails up via IMAP or receives them with SMTP, it is quite possible that such files appear without the user knowing about it beforehand.

It incorrectly classifies it as a text file and then fails to decode its bytes into characters. This is currently an unexpected failure and the only sensible thing, imho, is to stop (after some general retry, because the processing engine doesn't know better). Then there will be an item created and this weird file is also present.

I don't think it is a good idea to keep going on unexpected errors. Instead we can make exactly this error an expected one (it should be) and skip the attachment when it happens. So decoding bytes into characters can fail and this failure must be obvious and handled in the processing. But I wouldn't like to see every error like this.

This is relevant to #2404 rather than this issue, which is only about the misidentification.

eikek commented 1 year ago

What's the problem?

The problem is, that this information is often wrong. It will then fail just as well.

eikek commented 1 year ago

If this issue is only about the file type detection, it should be filed at tika.

madduck commented 1 year ago

I've registered for an account with Apache Jira. It wasn't clear to me who's doing what in the pipeline. Is there a document that illustrates the processing pipeline?

eikek commented 1 year ago

Not a very detailed one as these get outdated way to quickly. The mime type detection is done via tika as I said previously. It is not only looking at the filename. But things can go wrong here 🤷🏼 I suppose there will be other files in the future that will be not correctly detected, but for me at least this is not a big issue. They can't be processed anyways, so I wouldn't bother too much with raising issues.

madduck commented 1 year ago

I've raised https://issues.apache.org/jira/browse/TIKA-4172

I agree, @eikek, that there will be other files misclassified in the future, This is why I think #2404 is important.

madduck commented 1 year ago

So I spent some time with tika. Version 2.9.1 is running on localhost:9998, and I pinpointed the problem:

% curl -T ~/.tmp/d0101c66_mySQL40.sql http://localhost:9998/meta/Content-Type
Content-Type,application/octet-stream

but:

% curl -T ~/.tmp/d0101c66_mySQL40.sql -H "Content-Disposition: attachment; filename=d0101c66_mySQL40.sql" http://localhost:9998/meta/Content-Type
Content-Type,text/x-sql; charset=IBM424

so indeed, the filename hint is too persuasive to tika. Will experiment…

@eikek why are you passing metadata to tika at all?

madduck commented 1 year ago

If I don't set TikaCoreProperties.RESOURCE_NAME_KEY of the metadata passed to tika to the filename (but I am actually keeping the advertised content type), then this particular, problematic file is not misclassified and instead included as application/octet-stream.

eikek commented 1 year ago

why are you passing metadata to tika at all?

Well, why not :)? I trust tika in doing a good job (which it does!) and give the information I have. Not sure if I would want to change this only because of this file.

madduck commented 1 year ago

Well, why not :)?

Because tika actually doesn't do a good job, and it doesn't claim it tries. Look here at the docs:

Where the name of the file is known, it is sometimes possible to guess the file type from the name or extension. Within the tika-mimetypes.xml file is a list of patterns which are used to identify the type from the filename.

However, because files may be renamed, this method of detection is quick but not always as accurate.

Moreover, you've previously argued that you wouldn't trust the Content-Type provided in an email, but you do trust the filename provided?

By the way, it's not only this file. There are several examples of when detection goes wrong, but they all seem fixed when the filename isn't passed.

eikek commented 1 year ago

Because tika actually doesn't do a good job, and it doesn't claim it tries

In my opinion, tika does a very good job. There is no perfect thing. If you read further down the docs you posted, you'll see that it not only does name based recognition.

Moreover, you've previously argued that you wouldn't trust the Content-Type provided in an email, but you do trust the filename provided?

You wanted the content type to be used as is from the mail, which I objected. Not sure what you want to suggest here.

By the way, it's not only this file. There are several examples of when detection goes wrong, but they all seem fixed when the filename isn't passed.

What do you propose? Not giving any information to tika, no filename no given content type? What is the correct thing to do for every possible file that arrives?

One important point is "seem fixed" for me. It is more important to detect a supported file correctly than to detect an unsupported file correctly. Throwing unsupported files at docspell that are not handled perflectly well, is not such an important issue to me.

madduck commented 1 year ago

I believe that tika has a bug in this case, which is why I filed an issue with Jira. Let's see where that leads.

My point about the content-type from mail vs. filename is that if you also fed the content-type of the attachment to tika, it would do the right thing. But you didn't want to do that, because it can be faked. You are, however, feeding it the filename, and only the filename. This can be faked just as easily.

I guess I am trying to suggest: all or nothing, but doing only the filename and not the content-type does seem a bit arbitrary. We need to get the Jira bug fixed. Meanwhile, I am working around the issue by ignoring the filename. Long-term, I think it'd be okay to just rely on the content analysis and not provide any metadata, or to provide everything we have.

Do you have an example of a "supported file" where you think it's paramount to provide the filename to tika, because it wouldn't be able to classify cased on content?

eikek commented 1 year ago

You are, however, feeding it the filename, and only the filename. This can be faked just as easily.

No, this is not true... not sure why you keep saying this. Tika gets all the information available, the file contents, the given content-type and the filename.

madduck commented 1 year ago

I am sorry @eikek if I am not understanding, and creating extra work for you. I was looking at https://github.com/eikek/docspell/blob/master/modules/files/src/main/scala/docspell/files/TikaMimetype.scala#L46 and saw only HTTPHeaders, but it occured to me only now that the hint.advertised property is the source, and if you're saying that this will be application/applefile in the example up top, the I think tika really has it backwards, because it really should not prefer a filename match over an explicit content-type…