artefactual-sdps / preprocessing-sfa

preprocessing-sfa is an Enduro preprocessing workflow for SFA SIPs
2 stars 0 forks source link

Problem: SIP with a missing content directory errors without information in Enduro UI #32

Closed sallain closed 3 weeks ago

sallain commented 2 months ago

Describe the bug

Christa uploaded a package with the following structure:

SIP_20240705_Vecteur_344423
└── header
    ├── content
    │   └── d_0000001
    │       ├── 00000001.jp2
    │       ├── ...
    │       └── Prozess_Digitalisierung_PREMIS.xml
    ├── metadata.xml
    └── xsd
        ├── ablieferung.xsd
        ├── ...
        └── zusatzDaten.xsd

It's missing the content directory that we would normally expect to find at the top level of the tree. Temporal shows that it was correctly identified ("Type": "VecteurSIP") and that the content folder is missing ([{"Failures":["Content folder is missing"]}]). However, a few activities later, Enduro tried to create the premis.xml and save it to the content directory, resulting in a failure:

{
  "message": "lstat /var/lib/enduro/preprocessing/enduro1540872538/extract4239907984/SIP_20240705_Vecteur_344423/content: no such file or directory",
  "source": "GoSDK",
  "stackTrace": "",
  "encodedAttributes": null,
  "cause": {
    "message": "no such file or directory",
    "source": "GoSDK",
    "stackTrace": "",
    "encodedAttributes": null,
    "cause": null,
    "applicationFailureInfo": {
      "type": "Errno",
      "nonRetryable": false,
      "details": null
    }
  },
  "applicationFailureInfo": {
    "type": "PathError",
    "nonRetryable": false,
    "details": null
  }
}

And in the Enduro UI, none of the child workflow tasks are displayed:

![image](https://github.com/artefactual-sdps/preprocessing-sfa/assets/4612276/6441026b-adc6-46a9-8750-dc6607b2c4c3)

To Reproduce

Steps to reproduce the behavior:

  1. Run the attached package
  2. Review the Enduro UI and Temporal

Expected behavior

I expect to see the Identify SIP structure job displayed in the Enduro UI with a failure message indicating that the content directory is missing.

Screenshots

If applicable, add screenshots to help explain your problem.

Additional context

Add any other context about the problem here.

fiver-watson commented 1 month ago

A thought on this: if we solve this in the Identify SIP structure activity, then that task starts to duplicate the function of the Validate structure activity. It also sounds like the Validate Structure activity DID in fact notice that the top-level content directory was missing, but processing continued (to see if there were other errors).

Right now, the validate structure activity is just looking for specific top-level directories. It would be much improved if it actually used the structure in the metadata files of the SIP to compare against the actual SIP structure, and use that for validation - this would be a more robust check. We can discuss the complexity of this internally, and file a separate issue if agreed.

Additionally, it seems like we are going a bit too far before stopping. We should run all validation checks (metadata, structure, file formats), and if there are any errors, STOP before trying to write a PREMIS file, fail the ingest, and report to the user.

aseles13 commented 18 hours ago

Tested and Enduro is catching empty directories and missing content