exponential-decay / demystify

Engine for analysis of Siegfried export files and DROID CSV. The tool has three purposes, break the export into its components and store them within a SQLite database; create additional columns to augment the output where useful; and query the SQLite database, outputting results in a readable form useful for analysis by researchers and archivists within digital preservation departments in memory institutions. The tool will find duplicates, unidentified files, blacklisted objects, character encoding issues, and more.
http://www.openplanetsfoundation.org/blogs/2014-06-03-analysis-engine-droid-csv-export
zlib License
23 stars 5 forks source link

YAML may be malformed, or contain unexpected information #95

Closed ross-spencer closed 2 years ago

ross-spencer commented 2 years ago

Apparently some files can be packaged with \0d characters in the filename. E.g.

---
filename : 'Mac RF Test.zip#Mac RF Test\ProTools AppleSingle\Demo Session\Audio Files\Icon
'
filesize : 0
modified : 1996-11-14T13:29:32Z
errors   : 'empty source'
matches  :
  - ns      : 'pronom'
    id      : 'UNKNOWN'
    format  : 
    version : 
    mime    : 
    basis   : 
    warning : 'no match'
---
filename : 'Mac RF Test.zip#__MACOSX\Mac RF Test\ProTools AppleSingle\Demo Session\Audio Files\._Icon
'
filesize : 2790
modified : 1996-11-14T13:29:32Z
errors   : 
matches  :
  - ns      : 'pronom'
    id      : 'fmt/503'
    format  : 'AppleDouble Resource Fork'
    version : '2'
    mime    : 'multipart/appledouble'
    basis   : 'byte match at 0, 8'
    warning : 
---
filename : 'Mac RF Test.zip#Mac RF Test\ProTools AppleSingle\Demo Session\Audio Files\Icon
.as'
filesize : 2829
modified : 2022-03-03T10:20:10Z
errors   : 
matches  :
  - ns      : 'pronom'
    id      : 'fmt/968'
    format  : 'AppleSingle'
    version : '2'
    mime    : 'application/applefile'
    basis   : 'byte match at 0, 8'
    warning : 
---

Visible in the attached ZIP: Mac RF Test.zip (from Tyler Thorstead's iPRES sample set)

Demystify won't handle this and so we need to detect the formatting issue and then react to it so that a report can still be generated.

ross-spencer commented 2 years ago

Fixed via https://github.com/exponential-decay/demystify/pull/96 and https://github.com/exponential-decay/sqlitefid/pull/14