eikek / docspell

Assist in organizing your piles of documents, resulting from scanners, e-mails and other sources with miminal effort.
https://docspell.org
GNU Affero General Public License v3.0
1.59k stars 120 forks source link

Joex imports unnecessary embedded images from HTML messages on "Import only attachments" #1398

Open arittner opened 2 years ago

arittner commented 2 years ago

Hi!

Sometimes I get HTML mails with PDF invoices and embedded images (logs from the companies). Joex imports the image and the PDF. But the image is really not needed. Embedded images in mails usually have a special identifier in the MultiMimePart header (cid:random-number). It would be very useful if these are filtered out during import. In Docspell it is very time-consuming to delete the superfluous images of a mail import and in addition the embedded graphics are always displayed first, but not the interesting attachments.

eikek commented 2 years ago

I agree we could try to detect "inline attachments" when the option "import attachments only" is checked.

arittner commented 2 years ago

This would be very useful, because removing unwanted attachments is really effortful.

I'm checking, if I'm able to contribute PRs, I have to get my dusty Scala knowledge out of my head first, then setup my environment. I rather grew up with Java. :-/

eikek commented 2 years ago

I have a longer java history myself :-) so no worries.The email stuff is handled by this library - maybe the functionality to detect inline attachments could be added there ….

arittner commented 2 years ago

Yeah, I guess https://github.com/eikek/emil/blob/master/modules/javamail/src/main/scala/emil/javamail/conv/BodyDecode.scala should read the inline flag.

Example of a body part:

--------------090303020209010600070908
Content-Type: image/png;
 name="blabla.png"
Content-Transfer-Encoding: base64
Content-ID: <part1.06090408.01060107>
Content-Disposition: inline;
 filename="blabla.png"

An HTML message can refer to this body part with cid:part1.06090408.01060107. And all elements should be related together with Content-Type: multipart/related; boundary="------------090303020209010600070908"

But the related information is not needed. The Content-Disposition: inline; is the important information to exclude the attachment.

The BodyDecode.scala maybe read out this information from the Part and in Attachment a new attribute is needed to store this information. Joex should check it and filter out the flagged attachments with inline=true.

But this is only a very short review in the git repo source code. I may have overlooked something. I'll double-check it in the IDE.

eikek commented 2 years ago

Ah good catch! Yes I think this is a great idea. This information can be carried in the attachment. Only little issue is the other way around. What does it mean to add an "inline" attachment to a newly constructed mail. This is not supported in emil. Maybe we can hide the constructor or something, so that users are not confused.

arittner commented 2 years ago

Allowing inline attachments (in the mail builder) should manage the multipart/related (if you have both in a mail: plain text and html). This seems a bit too complex just to find out if they are inline attachments.

Maybe we can hide the constructor or something, so that users are not confused.

yes, and the access to "inline" should be "read-only"

arittner commented 2 years ago

Hello @eikek

I've been a bit distracted with other things lately, unfortunately, but I haven't forgotten about this. Therefore, now an update:

I changed my mind, and I think we should allow setting the Content-Disposition for an attachment. Maybe this will not include all possible cases you can manage with Content-Disposition, but I found a small bug in the BasicEncoder. The filename for an attachment will not apply without a Content-Disposition. This can be fixed with an implicit set, but then more control for the Emil user makes sense.

So my current implementation parse the Content-Disposition to distinguish between inline and attachment. The information will be stored in Attachment. But I also implemented in BasicEncoder the encoding for Content-Disposition type (inline, attachment) and a proper filename handling (because a filename w/o a disposition type doesn't work). I had to provide Attach with a few utility methods for this as well.

I wrote also some tests to check if this is working. Furthermore, I'm not 100% sure if I implemented everything perfectly Scala-style, I'm still very much in Java after all.

In the next time, I'll create an PR in the Emil project.

eikek commented 2 years ago

I think that sounds good. I remember that I also played around a little and ended up adding a Disposition enum to the Attachment class. I think the improper filename handling is "just a bug", I had hoped that the appropriate header is set (hardcoded to "attachment"). Also strange (as in "not good" :-)) that no test has captured this! Great if you want to open a PR at emil (we can discuss things there) - thank you! And don't worry about scala style, I can comment if you like and/or adjust.

arittner commented 2 years ago

Done, see: https://github.com/eikek/emil/pull/345

arittner commented 2 years ago

Great, @eikek, the PR is merged - thank you for your support.

Now I try to identify where to filter it. If I see it correct, we have a single point docspell.joex.mail.ReadMail. In docspell.joex.mail.ReadMail#mailToEntries Joex extracts the mail entries. I see only one reference to this method: docspell.joex.process.ExtractArchive#extractMail. The ExtractArchive confuses me a bit, but maybe this is a generic uploader from Joex to upload all kind of files to the database (and use ZIP-archives to manage multiple files in a container). Can you confirm this? Or am I missing something very basic?

So if I'm on the right path, the magic has to happen here:

  def mailToEntries[F[_]: Async](
      logger: Logger[F],
      glob: Glob,
      attachmentsOnly: Boolean
  )(mail: Mail[F]): Stream[F, Binary[F]] =
    Stream.eval(
      logger.debug(
        s"E-mail has ${mail.attachments.size} attachments and ${bodyType(mail.body)}"
      )
    ) >>
      (makeBodyEntry(logger, glob, attachmentsOnly)(mail) ++
        Stream
          .eval(TnefExtract.replace(mail))
          .flatMap(m => Stream.emits(m.attachments.all))
          .filter(a => a.filename.exists(glob.matches(caseSensitive = false)))
          .map(a =>
            Binary(a.filename.getOrElse("noname"), a.mimeType.toLocal, a.content)
          ))

The simplest solution is certainly to completely ignore the inline attachments. But I think this will be a surprise for some users. I can't think of many scenarios why anyone would need them, but there may be use cases.

The next option would be to make that available as a configuration value of the environment (without UI) for now. That's inconvenient, but it would be a solution for testing.

At the same time or possibly as an extra implementation, a configuration in the UI would be useful to set the global configuration parameter individually for users.

What do you think?

eikek commented 2 years ago

I think you are on the righ track. It works by passing the flag from the scan mailbox job to the arguments of the file processing job. So you probably traced down attachmentsOnly in the ProcessItemArgs and arrived on the correct place. It is only this one.

I'm also not so sure how to move forward. On the one hand I think it's fine to filter out all inline attachments. But yeah, it might hit some other use case we can't think of right now. I think a UI is not needed for this, because to me it makes sense to not include inline attachments when this already existing flag is true.

I like the approach to put the flag into the config, as you mentioned. Instead of directly using env variables we can put it in the config file. There is already a section user-tasks.scan-mailbox where we could add this (search for Config.ScanMailbox). Then we can just filter them in mailToEntries and pass it the result of &&-ing both flags in ScanMailboxTask. As I don't expect any issues, I would enable it by default and if people run into problems, they can disable it. Should more problems arise from this change we can still think about a UI. Wdyt?

arittner commented 2 years ago

As I don't expect any issues, I would enable it by default and if people run into problems, they can disable it. Should more problems arise from this change we can still think about a UI. Wdyt?

Perfect. With this approach, we have all options to add later a UI config, if needed.