archivematica / Issues

Issues repository for the Archivematica project
GNU Affero General Public License v3.0
16 stars 1 forks source link

False positive starts unpacking task #328

Open lwo opened 5 years ago

lwo commented 5 years ago

Expected behaviour The ingest should not stop after an error due to a false positive.

Current behaviour During the SIP phase, this procedure runs: "Determine if transfer still contains packages:". It then finds a file with a ".001" extension: https://repository-pipeline-2.collections.iisg.org/tasks/0a9ec0cc-1776-4f75-9041-f4f716b9179b/ %transferDirectory%objects/diskettes_5-25/14/RPS_GAME.001 is extractable and has not yet been extracted.

This identification is not correct. The file is an ASCII file. This rule is responsible for the mis identification: /fpr/fprule/bdfc3ef8-99a6-48e2-9017-8c39010a622a/

It then "Extract contents from compressed archives" and fails, as the file is not a package but a ASCII file.

Disabling the identitication rule gives us a valid SIP.

Steps to reproduce Add an RPS_GAME.001 file to the transfer. In our case, the RPS_GAME.001 file was inside a zip; got unpacked and then gets mis identified.

Your environment (version of Archivematica, OS version, etc) Ubuntu 16.04 lts ; fork: https://github.com/IISH/archivematica


For Artefactual use: Please make sure these steps are taken before moving this issue from Review to Verified in Waffle:

sallain commented 5 years ago

Hi @lwo - which file identification tool did you use that resulted in the misidentification of the .001 file? Wondering if this should be cross-posted to the Siegfried or Fido projects.

ross-spencer commented 5 years ago

Taking a brief look at this for IISH.

The problem stems from the custom Siegfried signature file that we employ that contains four custom signatures.

The results from the two microservices that result in the problem (identify file formats within the extract packages microservice group, and determine if transfer still contains packages) are as follows:

IDCommand: Identify using Siegfried 1.7.10
IDCommand UUID: 75290b14-2931-455f-bdde-3b4b3f8b7f15
IDTool: Siegfried
IDTool UUID: 454df69d-5cc0-49fc-93e4-6fbb6ac659e7
File: (cc6e01c3-a1cd-4583-8d81-a4bc6d22568e) /var/archivematica/sharedDirectory/watchedDirectories/workFlowDecisions/extractPackagesChoice/fp1-f4ed20aa-df48-45ec-9bf6-60090d24e4f5/objects/compresed-example.zip-2019-05-23T13_20_52.315803_00_00/fake-zip-volume.001
Command output: archivematica-fmt/4
/var/archivematica/sharedDirectory/watchedDirectories/workFlowDecisions/extractPackagesChoice/fp1-f4ed20aa-df48-45ec-9bf6-60090d24e4f5/objects/compresed-example.zip-2019-05-23T13_20_52.315803_00_00/fake-zip-volume.001 identified as a Raw Disk Image

And:

%transferDirectory%objects/compresed-example.zip-2019-05-23T13_20_52.315803_00_00/fake-zip-volume.001 is extractable and has not yet been extracted.

So you can see the archivematica-fmt/4 signature triggering the extraction. For a file that cannot truly be extracted will of course fail.

The result can also be verified in Siegfried by looking at the results on the command line, e.g. by logging into the docker-container in this instance and setting up some sample files:

---
siegfried   : 1.7.10
scandate    : 2019-05-23T13:15:06Z
signature   : archivematica.sig
created     : 2018-09-19T11:32:33+10:00
identifiers : 
  - name    : 'archivematica'
    details : 'fddXML.zip (DROID_SignatureFile_V94.xml, container-signature-20180917.xml); extensions: archivematica-fmt2.xml, archivematica-fmt3.xml, archivematica-fmt4.xml, archivematica-fmt5.xml'
---
filename : 'false-positive.001'
filesize : 0
modified : 2019-05-23T13:12:59Z
errors   : 'empty source'
matches  :
  - ns      : 'archivematica'
    id      : 'archivematica-fmt/4'
    format  : 'Raw disk image'
    version : 
    mime    : 
    basis   : 'extension match 001'
    warning : 'match on extension only'
---
filename : 'false-positive-text.001'
filesize : 45
modified : 2019-05-23T13:15:04Z
errors   : 
matches  :
  - ns      : 'archivematica'
    id      : 'archivematica-fmt/4'
    format  : 'Raw disk image'
    version : 
    mime    : 
    basis   : 'extension match 001'
    warning : 'match on extension only'
---
filename : 'false-positive.txt'
filesize : 21
modified : 2019-05-23T13:12:28Z
errors   : 
matches  :
  - ns      : 'archivematica'
    id      : 'x-fmt/111'
    format  : 'Plain Text File'
    version : 
    mime    : 'text/plain'
    basis   : 'extension match txt; text match ASCII'
    warning : 

One of these files is plain-text. another plain-text with 001 extension and another an empty file.

The original identification result happens for Fido too with the command we run looking as follows:

    cmd = ['fido', '-bufsize', '1048576',
           '-loadformats', '/usr/lib/archivematica/archivematicaCommon/externals/fido/archivematica_format_extensions.xml',
           os.path.abspath(file_)]

And the equivalent identification microservice:

IDCommand: Identify using Fido 1.3.12
IDCommand UUID: 213d1589-c255-474f-81ac-f0a618181e40
IDTool: Fido
IDTool UUID: c33c9d4d-121f-4db1-aa31-3d248c705e44
File: (8b3abf06-0581-468c-97a5-cf24eeb8139d) /var/archivematica/sharedDirectory/watchedDirectories/workFlowDecisions/extractPackagesChoice/fp2-22280d44-3b8e-49ca-a55b-6284685bff88/objects/compresed-example.zip-2019-05-23T13_25_54.659000_00_00/fake-zip-volume.001
Command output: archivematica-fmt/4
/var/archivematica/sharedDirectory/watchedDirectories/workFlowDecisions/extractPackagesChoice/fp2-22280d44-3b8e-49ca-a55b-6284685bff88/objects/compresed-example.zip-2019-05-23T13_25_54.659000_00_00/fake-zip-volume.001 identified as a Raw Disk Image

Information about the signature

There's not a lot of information about the signature, if you use the roy inspect function:

roy inspect archivematica-fmt/4
RAW DISK IMAGE (ARCHIVEMATICA-FMT/4)
globs: *.dd, *.dd.001, *.001
superiors: none

The signature can also be viewed in the Siegfried repository.

Way forward

Without adding a greater number of disk image file formats to PRONOM, then perhaps we can question the merits of failing the transfer if the extract fails? The attempt to extract could itself be considered a secondary validation measure... The parameters for the microservice to fail could also be nuanced.