eikek / docspell

Assist in organizing your piles of documents, resulting from scanners, e-mails and other sources with miminal effort.
https://docspell.org
GNU Affero General Public License v3.0
1.62k stars 121 forks source link

Handling of problematic attachments needs to be done gracefully #2404

Open madduck opened 11 months ago

madduck commented 11 months ago

This issue is being carried over from https://github.com/eikek/docspell/issues/2376, and is about the situation, where an attachment of an email, or an archive component has been misclassified (see #2403), thus causing Docspell to pick an interpreter for the identified file type, which is then likely to fail.

If an interpreter/converter fails to interpret/convert a given attachment (with a given content type), then it seems there might be three possible explanations:

  1. the tool had a temporary failure, e.g. memory exhaustion or cosmic rays;
  2. the tool had a permanent failure, e.g. a bug, or a corrupt input file, or an alien invasion;
  3. the wrong tool was picked to do the job.

The first of these can possibly be countered with a retry or two. The second and third will never terminate in an acceptable result. This issue is about what would be the best course of action for Docspell in such cases.

Currently, Docspell will retry, but if a failure doesn't resolve itself, it will skip remaining attachments, skip post-processing (OCR & preview images), and it also won't invoke addons or job-done webhooks. The generated item will be obviously incomplete for the user to take note, though it is questionable what options the user has. In the case of an email message with an unexpected attachment, it might be too much to expect the user to perform surgery on the MIME tree, in an attempt to fix things.

I would like to argue that in all three of the above cases (when temporary failure isn't resolved during n retries, and 2–3 unchanged), it would be better to take on the raw original of the problematic attachment, but continue processing the pipeline just like as if the attachment had been processed just fine. The end result will be a series of attachments to an item, but any unconverted, raw ones will simply sit there, without the browser able to display them, and only able to offer a raw download of the attachment in question. Addons and webhooks can do whatever they want, they might even expect the attachment, and be able to do more with it.

When OCR and preview generation encounter a raw attachment, they should just skip it. And yes, addressing this issue might also involve touching the UI to improve handling/display of raw attachments.

eikek commented 11 months ago

Oh I just commented on #2403 - I'm not sure how this issue is different from #2403?

madduck commented 11 months ago

The two issues are very closely related. I can confirm that if I rename the problematic attachment e.g. apple.bin, then it works exactly as it should, and that's great.

This issue is about the cases when there's a permanent failure during conversion, and how Docspell should handle that. Such a permanent failure could be due to mis-identification (#2403), or it could be due to a corrupt file, or some other reason. It's unlikely that the user will be able to do anything about it, especially when the file is an attachment to an email.

Currently, an email with a permanent failure in the middle of the processing pipeline will yield an item that has the attachments included (this is good), but no previews or OCR, not even for attachments that came before a failed attachment:

image

I understand your point about failure being a failure, but what speaks against generating previews and running OCR against the attachments that can be successfully processed?

eikek commented 11 months ago

I'm not against this in general (wrote it as well in the closed issue). I only don't like to do this on any failure. This is always a bit not nice, imho. I would prefer to raise this specific error (exception from decoding bytes into characters) as a known one and skip the corresponding attachment during processing. Catching every error would defeat the stop that is currently done or it means making good decisions about the errors that come up and imho this is not really possible.

madduck commented 11 months ago

Sweet, thanks @eikek. I am almost ready to move on. Can you please let me know (maybe one more time?) why you prefer the current handling? If a single attachment fails, why do you want previews and OCR skipped for everything?

eikek commented 11 months ago

It would then stop being retried entirely, and some errors are temporarily. I'd like to keep the retries.

madduck commented 11 months ago

I am not suggesting losing the retries. The way I imagine it would happen is that the item stays in the queue until all its parts have been converted, or the number of retries on individual parts has exceeded.

In the example case, the conversion of parts 1 & 3 would succeed on first try. Part 2 would fail. The item stays in the queue, and gets retried. Upon retry, parts 1 & 3 do not need to be tried again, so joex would just give part 2 another shot, and if it fails again, another cycle of retry until we've reached maximum retries, at which point joex just includes the raw attachment, which it failed to convert, along with the ones it converted just fine.

When reprocessing an item, the same thing would happen again.

Does this make sense?

eikek commented 11 months ago

Yes and no. Currently joex sees the processing job of one item as one single thing.

Why is it not good to deal with this problematic file by lifting that decoding error as described here?

v6ak commented 11 months ago

@eikek I think I understand your reasons, but I am not 100% sure. My understanding:

Is my understanding correct?

madduck commented 11 months ago

Yes and no. Currently joex sees the processing job of one item as one single thing.

Oh, so it doesn't save information between retries, other than the number of retries and the logs? Obviously, it would need to keep the item being worked on around, and retry where it left off, not start from scratch every time again.

Why is it not good to deal with this problematic file by lifting that decoding error as described here?

I think it would be fine to have a set of exceptions that are accepted in this sense. I wouldn't ignore the attachment, but I would include it raw instead, though.

madduck commented 10 months ago

An email came in today with a Zip file that unzip can process just fine, but Java's Zip-file handling chokes on it:

Mo.. 27. November 2023, 13:43: Extracting zip archive 560-5005.zip.
Mo.. 27. November 2023, 13:43: Filtering zip entries with '*'
Mo.. 27. November 2023, 13:43: Job 8N3PMv9g9.../1/process-item/Low execution failed. Retrying later.: invalid CEN header (bad entry name)

So it's not about whether tika misclassifies or not. We need to generally expect that there will be attachments to items that Docspell cannot properly process.

I thought a bit about how to distinguish between temporary and permanent failures, but I am failing to come up with a realistic case of a temporary failure that would solve itself within a few minutes so that a retry would fix it. Post-processing doesn't use networking, so the only temporary problems I can see are around resource exhaustion. If the filesystem is full, this won't just fix itself sufficiently within a minute. Same with RAM really. And if a subprocess takes too long, then joex can remain in control and we can handle the retry right there.

Thus, I am wondering if the simplest approach wouldn't be to retry within the same joex invocation while keeping the state for as long as a failure can be deemed temporary and it makes sense to keep trying. But if joex gives up, it should not return the item to the queue.

Ideally, it would process all other attachments, and run OCR, addons, hooks, etc., include the failed attachment raw, and possibly flag the item as failed, maybe leaving a note somehow, so that the user can see to the problem and possibly just submit the item for reprocessing.

v6ak commented 9 months ago

This needs to be handled on multiple places (ConvertPdf, TextExtraction, …), as the code iterates over attachments on multiple places.

In my first version, I was adding stubs, which was easy in ConvertPdf (failed ra gets (ra, none[RAttachmentMeta])), but annoying in TextExtraction (dummy TextExtraction.Result). Then, I started just discarding the failed attachments. They are still stored as raw, but without further processing. I am not sure which one is better. The stubs look somewhat unclean, so I'd avoid them, but if there is a pragmatic reason for them, I can add them.

Also, I need to store information about the exceptions that have happened, so we know that the job has partially failed.

There are some possible further optimizations, e.g., we can fail-fast (i.e., the current approach) in all tries but the last one, which could be more fail-safe.

madduck commented 9 months ago

Thanks for the update @v6ak. I think discarding failed attachments and storing them as raw is better than stubs, as those wouldn't give access to the raw data. This way, anyone can try to make sense of the data with an addon.

Good thinking on the exceptions and optimizations. Right now I am not sure what to do here, except let's not make things too complicated. Ideally, we'd keep tabs on successes and failures, so that only failures need to be retried, but @eikek said that is a bit more involved of a job.

v6ak commented 9 months ago

I have a branch that handles problematic attachments gracefully:

The code changes aren't huge. It took much more time to understand it and to design it properly.

What can be done next:

Addons

Check (and maybe adjust) running of addons, as I haven't thought much about them. Maybe they should be run in case of success or in case of failure of the last retry. The failure should be somehow indicated.

EDIT: It seems that addons aren't run in this case. Maybe I should add an extra hook that handles a failure. Not sure if it should handle all the failures, or just a failure when no reattempt is possible.

Optimize the code

Currently, all processing is fault-tolerant, but a failure causes the job to fail at the end. When it is retried, everything is retried. I see two potential optimizations:

a. Easy: all tries except the last one are fail-fast (after first failure, you know you are going to retry), the last one is fault-tolerant. b. More optimal: Do not reprocess attachments that have been successfully processed. This might need deeper code adjustments.

Adjust multiple places

Just ConvertPdf and TextExtraction are adjusted, but it should be easy to adjust others to use AttemptUtils, too.

Test coverage

I'll have to think about proper testing.

v6ak commented 9 months ago

I've made AttachmentPreview and AttachmentPageCount failsafe, in slightly different way that is aligned with the current approach: When there is a failure, other attachments are processed (they used to be skipped), errors are logged (status quo) and processing continues even in case of any error (status quo).

The adjusted code is slightly simpler than it was, because the error handling of those two was moved to a common method in AttemptUtils.

EDIT: I haven't adjusted processing in analysisOnly method, and I am not sure if it is really important to adjust it.