denshoproject / ddr-cmdln

Command-line tools for automating the Densho Digital Repository's various processes.
Other
0 stars 2 forks source link

File import should ignore certain attributes in CSV, depending on external/internal value #145

Closed sarabeckman closed 5 years ago

sarabeckman commented 5 years ago

Today I was reviewing the metadata for the binary files I had imported for ddr-densho-379 and I noticed that several important pieces of data weren't present including the thumb values, sha1, sha256, md5, and size. I've included a sample JSON file. I didn't receive any error messages when importing the files or when I committed or synced the collections. I used my local VM for the import using both master 2.7.1 and 2.8.0.

This problem isn't present in everything I've imported. I noticed the missing data corresponds to imports I completed with CSV files that had thumb, sha1, sha256, md5, and size were columns in the CSV that were empty.

I thought the importer would create data for the attribute even with the empty value in the csv.

filemezzanine1.xlsx ddr-densho-379-1-master-dc0ae09755.txt

gjost commented 5 years ago

Doesn't fix this problem, but TODO make DDR ignore data sha1/sha256/md5 columns for existing records.

GeoffFroh commented 5 years ago

Certain values in CSV File import should be treated differently depending on whether it is an internal or external file (i.e., the external attribute is 0 or 1).

Logic flow:

if it is a new file import:
    if the file is stored in the DDR git-annex (i.e., is internal):
        ignore 'sha1', 
               'sha256', 
               'md5', 
               'size', 
               'mimetype' values in csv
        compute those values during ingest of 'basename_orig' binary
    else (the file must be stored outside the DDR; must be external):
        require and use 'sha1', 
                        'sha256', 
                        'md5', 
                        'size',
                        'mimetype' values in csv 
else (the operation must be a batch metadata update):
    ignore 'basename_orig',
           'external',
           'role',
           'sha1',
           'sha256', 
           'md5', 
           'size', 
           'mimetype' values in csv
GeoffFroh commented 5 years ago

TODO: Identify and fix Files with this problem (i.e., missing checksum data) in existing collections.

sarabeckman commented 5 years ago

I've reviewed collections that had files ingested through ddrimport file. The only other collection I've found this issue with is ddr-csujad-13 in objects 1-15, 17,18, 27, 30-32 -- only master files.

gjost commented 5 years ago

Hey @GeoffFroh in the pseudocode, there's also the case of files that exist in /tmp/ddrshared and for which we want to calculate the hashes, but we don't want to actually ingest. What should those be called?

GeoffFroh commented 5 years ago

There are four cases that ddrimport needs to support: new-internal, new-external-nobin, new-external-bin and update

See: https://docs.google.com/document/d/1d2PeMIh7GRQn-Qt9P61ETpKa7alL_oWH3hiBnnOM-mE/edit?usp=sharing

The update case is not in the chart in the supplemental doc. The current code already detects that case based on whether the csv contains File identifiers (i.e., indicating “update this File”) or an Entity identifier (i.e., indicating “attach this new File to the specified Entity")

Logic flow:

if it is a new File import:
    if the File is stored in the DDR git-annex (i.e., is internal):
    ('new-internal')
        ignore 'sha1', 
               'sha256', 
               'md5', 
               'size', 
               'mimetype' values in csv
        compute those values during ingest of 'basename_orig' binary

    else (the File must be stored outside the DDR; must be external):

        if checksum, size and mimetype vals present in csv:
        (the File binary is not available locally for checksumming, etc.)
        ('new-external-nobin')
            require and use 'sha1', 
                            'sha256', 
                            'md5', 
                            'size',
                            'mimetype' values in csv

        else (the File is locally available for checksumming):
        ('new-external-bin')
            compute 'sha1', 'sha256', 'md5', 'size', 'mimetype' from binary

else (the operation must be a metadata-only update):
('update')
    ignore 'basename_orig',
           'external',
           'role',
           'sha1',
           'sha256', 
           'md5', 
           'size', 
           'mimetype' values in csv
sarabeckman commented 5 years ago

ddrimport file with internal files and no sha1/sha256/md5/size columns present in CSV -- runs successfully ddrimport file with internal files and empty sha1/sha256/md5/size columns does not run successfully -- does not write data to field in file json ddrimport file with external files with data in sha1/sha256/md5/size -- runs successfully ddrimport file with external files and empty sha1/sha256/md5/size columns -- errors out -- traceback as text doc attacted ddrimportfile_external_traceback.txt

I'll email @gjost the CSV I used to import files to ddr-testing-40274

Update: attached ddrimportfilecsvs.zip

gjost commented 5 years ago

Test cases

gjost commented 5 years ago

Fixed (hopefully) in ddr-cmdln commit 9ac1bfa for package ddrlocal-develop~2.8.7-4.

gjost commented 5 years ago

@sarabeckman Which cases are still not working for you?

gjost commented 5 years ago

Regarding the test CSVs, I also need the binary files you tested with, and it would really help if you could include the commands you used to test import everything. Otherwise I'm not really seeing the same thing as you.

Please have a look at the test CSVs in /opt/ddr-local/ddr-cmdln/ddr/test/ddrimport/. Here are the test CSVs in the order that the unit tests load them:

Order Filename Notes
1 ddrimport-entity-new.csv Creates some entities.
2 ddrimport-entity-update.csv Updates the entities.
3 ddrimport-files-import-external.csv Imports external files with hashes.
4 ddrimport-files-import-external-emptyhashes.csv Tries to import external files with empty hashes (currently fails).
5 ddrimport-files-import-external-nohashes.csv Tries to import external files with no hash columns (currently fails).
6 ddrimport-files-import-internal.csv Imports internal files. At least one row in CSV has blank/empty hash columns.
7 ddrimport-files-import-internal-nohashes.csv Imports internal files with no hash columns.
8 ddrimport-file-update.csv Updates hash files
sarabeckman commented 5 years ago

The binary files for the CSVs I used for my testing can be found here: /media/qnfs/kinkura/working/csujadimport/processed/oh/ddr-csujad-29/csufccop_jaoh/csufccop_jaoh_preservation/

sarabeckman commented 5 years ago

I took the binary files and CSVs that @gjost uses in his test suite and used them in a test in my own environment. I created ddr-testing-40276. I imported ddrimport-entity-new.csv using ddrimport entity using my Dev VM. I then imported ddrimport-files-import-internal.csv. The file with empty hash fields still had empty hash fields once imported. No errors where thrown during the import.

This is the only case the currently is reported to work in @gjost's tests but doesn't work in my dev VM.

GeoffFroh commented 5 years ago

@sarabeckman retested with three csv files from @gjost. 1) external with hash vals present in csv. No binaries. Worked as expected. 2) external with no hash vals present in csv (but empty cols exist). Binaries are present. Did not work. Copied bins with new ddr name, but did not insert vals into ddr entity.json files. 3) external with no hash cols or vals. Binaries are present. Worked as expected. (Binaries are hashed; hash vals are inserted into file json; binaries copied to new binaries with new ddr names) 4) internal with no hash vals present in csv (but empty cols exist) Did not work. Same as 2 above.

sarabeckman commented 5 years ago

Tested develop_2.8.7-7 @gjost

  1. external with no hash vals present in csv (but empty cols exist). Binaries are present. -- Import failed with traceback "Exception: No file ingest action for 'nobin, external, noattrs'" -- Personally this outcome is fine with me

  2. internal with no hash vals present in csv (but empty cols exist) - Did not work. did not insert vals into file.json or fail as in external test above.

Traceback from test 1 is attached.

ExternalEmptyHash_traceback.txt

External with hashes present in CSV and External with no hash columns in CSV both worked successfully as they did yesterday.

pkikawa commented 5 years ago

working as intended. tseted with both internal and external with no hashes. 2.8.7-8