exponential-decay / demystify

Engine for analysis of Siegfried export files and DROID CSV. The tool has three purposes, break the export into its components and store them within a SQLite database; create additional columns to augment the output where useful; and query the SQLite database, outputting results in a readable form useful for analysis by researchers and archivists within digital preservation departments in memory institutions. The tool will find duplicates, unidentified files, blacklisted objects, character encoding issues, and more.
http://www.openplanetsfoundation.org/blogs/2014-06-03-analysis-engine-droid-csv-export
zlib License
23 stars 5 forks source link

JSON output #73

Open ross-spencer opened 2 years ago

ross-spencer commented 2 years ago

Using the following pattern starts to get us into the realm of a decent JSON output.

import json
print(json.dumps(analysis_results.__dict__, sort_keys=True, indent=2))

The output below is roughly all the top level fields. There are at least two problems with the below:

  1. Insensitive naming in the context of a collection. Bad File Names is not representative of both the end-user's collection and the intention of the code. We want to express something more like "names that need more care", e.g. we've all seen ���� when we don't want to. This is more about making sure data is preserved in end to end workflows.

NB. In the output report, this field is Identifying Non-ASCII and System File Names. The naming comes from the member variable within the analysis results object.

  1. Naming conventions. The naming conventions are all over the place. Who knew PEP8 was a thing even in 2014?! (jk) At least snake case these. Golang JSON naming should be considered. In golang Capitalized fields are exported and can be read implicitly into code: bof_distance here, which is correct, might become BOFDistance, collectionsize becomes CollectionSize. I don't know if these names can be aliased somehow, where .__dict__ outputs member variables as-is.

There's a lot of data output, but JSON tools might be able to use this sensible. I should consider documenting examples.

Example output:

  "badDirNames": [], 
  "badFileNames": [
  "binaryidentifiers": [
  "bof_distance": [
  "collectionsize": 397567751, 
  "containercount": 13, 
  "containertypeslist": [
  "dateFrequency": [
  "denylist": null, 
  "denylist_directories": [], 
  "denylist_exts": [], 
  "denylist_filenames": [], 
  "denylist_ids": [], 
  "directoryCount": 51, 
  "distinctFilenameIdentifiers": 1, 
  "distinctOtherIdentifiers": 31, 
  "distinctSignaturePuidcount": 51, 
  "distinctTextIdentifiers": 5, 
  "distinctXMLIdentifiers": 0, 
  "distinctextensioncount": 59, 
  "duplicateHASHlisting": [
  "duplicatespathlist": [
  "eof_distance": [
  "errorlist": [
  "extensionIDOnlyCount": 3, 
  "extensionOnlyIDFrequency": [
  "extensionOnlyIDList": [
  "extmismatchCount": 25, 
  "filecount": 324, 
  "filename": "opf-test-corpus-test-output/opf-test-corpus-sf-analysis", 
  "filename_identifiers": [
  "filenameidentifiers": [
  "filenameidfilecount": 2, 
  "filesincontainercount": 0, 
  "frequencyOfAllExtensions": [
  "hashused": true, 
  "identificationgaps": 53, 
  "identifiedPercentage": "83.6", 
  "identifiedfilecount": 271, 
  "idmethodFrequency": [
  "mimetypeFrequency": [
  "multipleidentificationcount": 0, 
  "namespacecount": 3, 
  "namespacedata": null, 
  "nsdatalist": [
  "rogue_all_dirs": null, 
  "rogue_all_paths": null, 
  "rogue_denylist": [], 
  "rogue_dir_name_paths": [], 
  "rogue_duplicates": [
  "rogue_extension_mismatches": [], 
  "rogue_file_name_paths": [], 
  "rogue_identified_all": [
  "rogue_identified_pronom": [], 
  "rogue_multiple_identification_list": [], 
  "rogue_pronom_ns_id": null, 
  "signatureidentifiedfrequency": [
  "signatureidentifiers": [
  "text_identifiers": [
  "textidentifiers": [
  "textidfilecount": 8, 
  "tooltype": "siegfried: 1.5.0", 
  "totalHASHduplicates": 32, 
  "unidentifiedPercentage": "16.4", 
  "unidentifiedfilecount": 53, 
  "uniqueDirectoryNames": 50, 
  "uniqueExtensionsInCollectionList": [
  "uniqueFileNames": 315, 
  "version": 0, 
  "xml_identifiers": [
  "xmlidentifiers": null, 
  "xmlidfilecount": 0, 
  "zerobytecount": 28, 
  "zerobytelist": [
  "zeroidcount": 40