collectiveaccess / providence

Cataloguing and data/media management application
GNU General Public License v3.0
290 stars 166 forks source link

"Image is being processed" after data import with background processing enabled #1522

Closed jiru closed 8 months ago

jiru commented 8 months ago

The problem

Using providence branch dev/php8 commit 1c0630527a659d8f3b5f3cfda611d920455b9f7f.

After mass-importing 250 objects along with their representation pictures, most of them have "Image is being processed" as image, and it stays like that forever, even though caUtils process-task-queue is configured and confirmed to run every minute in crontab. This happens both using caUtils import-data and the web interface.

It’s a rather complex issue so I did my own research already. caUtils list-task-queue reveals that the missing images have not been processed by the mediaproc handler:

-----------------------------------------------
Task id: 404
notes: array {
    "errors": [
        "Record ca_object_representations.field = 477 did not exist; queued file was discarded"
    ],
    "notes": [],
    "processing_time": " 0.008"
}
parameters: array {
    "TABLE": "ca_object_representations",
    "FIELD": "media",
    "PK": "representation_id",
    "PK_VAL": 477,
    "INPUT_MIMETYPE": "image/png",
    "FILENAME": "<REDACTED>/media/collectiveaccess/images/4/27159_ca_object_representations_media_477_original.png",
    "VERSIONS": {
        "icon": {
            "VOLUME": "images"
        },
        "iconlarge": {
            "VOLUME": "images"
        },
        "tiny": {
            "VOLUME": "images"
        },
        "thumbnail": {
            "VOLUME": "images"
        },
        "widethumbnail": {
            "VOLUME": "images"
        },
        "small": {
            "VOLUME": "images"
        },
        "preview": {
            "VOLUME": "images"
        },
        "preview170": {
            "VOLUME": "images"
        },
        "widepreview": {
            "VOLUME": "images"
        },
        "medium": {
            "VOLUME": "images"
        },
        "mediumlarge": {
            "VOLUME": "images"
        },
        "large": {
            "VOLUME": "images"
        },
        "page": {
            "VOLUME": "images"
        },
        "tilepic": {
            "VOLUME": "tilepics"
        }
    },
    "OPTIONS": {
        "original_filename": "<REDACTED>"
    },
    "DONT_DELETE_OLD_MEDIA": true
}
handler_name: Background media file processor
by: Called from command line
error_code: 0
created: November 3 2023 at 10:01:55 (1699002115)
completed_on: November 3 2023 at 10:02:02 (1699002122)
Input format: image/png
Input file size: 1.41mb
Data source: ca_object_representations:media:477
Temporary filename: <REDACTED>/media/collectiveaccess/images/4/27159_ca_object_representations_media_477_original.png
Versions output: icon, iconlarge, tiny, thumbnail, widethumbnail, small, preview, preview170, widepreview, medium, mediumlarge, large, page, tilepic
-----------------------------------------------

The error is Record ca_object_representations.field = 477 did not exist; queued file was discarded, but there is definitely an existing representation with id 477 in the database after the import is finished.

Running the import with option --direct results into most images showing up, but there still some of them showing as "Image is being processed".

Running the import with background processing disabled in setup.php solves the issue.

My analysis

I believe the import data is added to the database from within an SQL transaction, while the mediaproc background task is not. As a result, the cronjob can see new entries in the ca_task_queue table but cannot see the associated data because it is not committed yet.

The fact that the cronjob is executed every minute makes it very likely to happen. Lowering the job frequency would make the problem less likely to occur, but it could still happen if an import happens to be running when the cronjob starts.

I think the proper fix is to make the insertion of the mediaproc task to happen within the SQL transaction.

collectiveaccess commented 8 months ago

How did you import this data? Via the data importer or media importer?

collectiveaccess commented 8 months ago

I believe that your analysis is correct. Transaction handling is the cause here. I've just pushed fixes for this in the dev/fix-1522 branch. Can you please try it and let us know if it resolves the issues you are experiencing?

https://github.com/collectiveaccess/providence/tree/dev/fix-1522

jiru commented 8 months ago

It was via the data importer. Thank you very much for the fix, I will get back to you as soon as I have a chance to try it out.

jiru commented 8 months ago

@collectiveaccess I applied your patch 0211f4174581b9a0622e1460f52950552f4da5dd but it looks like it does not solve the problem. New mediaproc tasks are still visible while the import is running.

jiru commented 8 months ago

@collectiveaccess Okay there was a typo in your patch, you need to s/transction/transaction/ app/lib/BaseModel.php then it works fine.

jiru commented 8 months ago

I will let you close this issue. Thanks again for your help!

collectiveaccess commented 8 months ago

Yeah, oops. Thanks for testing.