R0Wi-DEV / workflow_ocr

This is a Nextcloud Workflow App which enables you to process files via OCR on serverside.
GNU Affero General Public License v3.0
79 stars 6 forks source link

OCR workflow should maintain modification date of original file #256

Open ferdiga opened 1 month ago

ferdiga commented 1 month ago

Describe the bug

we plan to load historical pdf files into the database and want to make them searchable using OCR workflow, which changes the modification date of the file - hence the important historical context of the modification date is "lost", limiting the usability of this great feature.

The ocrmypdf maintainer confirms, that ocrmypdf must change the modification date to comply to the standard.

For the OCR workflow I see 2 options:

I have created a little python script which prepends the original modification date to all pdf files if no date is found at the beginning of the file to overcome this situation, but want to clarify the situation before I proceed.

System

How to reproduce

Steps to reproduce the behavior: trigger the OCR Workflow

ferdiga commented 1 month ago

Additional remark: I would go for "restore the original modification date after adding the OCR layer." because

R0Wi commented 1 month ago

Hi @ferdiga, thanks for the comprehensive explanation of your use-case. I think you already described that changing a file (so adding the OCR layer) automatically changes the last modified date, which is the expected behaviour when touching a file on a system and writing new content to it. The app itself just utilizes the NC API to create a new file version here. The used file_put_contents just writes the file to the disk and creates a new file version in Nextcloud without the option to change any file metadata (see here.

A possible way to implement this after the new file version has been written would be to use touch with a second argument (the old timestamp). In the UI we'd need to have an additional parameter like "Maintain original modification date". If set to true, we'd need to store the original modification date before creating the new file version, and write it back after it has been created.

Possible Workaround

For the time being one could "chain" the Workflow OCR with the Workflow Script:

  1. Create the OCR Workflow and choose "Assign tags after OCR". Choose any tag you want to assign after successful OCR (for example "OCR success")
  2. Create a second Workflow with the Workflow Script. Use the tag assignment for "OCR success" as a trigger for this workflow and implement your modification date magic directly within the triggered script
ferdiga commented 1 month ago

Hi, thanks for looking into this, Option 2: once the file has the new tag, it has also the new timestamp. IMHO not the way to go.

What I probably will do

another script is necessary for digitaly signed files - print not copy to destroy the signature, because the original must be preserved (ocrmypdf will not touch it) , nevertheless we want to have a searchable version.