Closed sarabeckman closed 5 years ago
Doesn't fix this problem, but TODO make DDR ignore data sha1/sha256/md5 columns for existing records.
Certain values in CSV File
import should be treated differently depending on whether it is an internal or external file (i.e., the external
attribute is 0
or 1
).
Logic flow:
if it is a new file import:
if the file is stored in the DDR git-annex (i.e., is internal):
ignore 'sha1',
'sha256',
'md5',
'size',
'mimetype' values in csv
compute those values during ingest of 'basename_orig' binary
else (the file must be stored outside the DDR; must be external):
require and use 'sha1',
'sha256',
'md5',
'size',
'mimetype' values in csv
else (the operation must be a batch metadata update):
ignore 'basename_orig',
'external',
'role',
'sha1',
'sha256',
'md5',
'size',
'mimetype' values in csv
TODO: Identify and fix File
s with this problem (i.e., missing checksum data) in existing collections.
I've reviewed collections that had files ingested through ddrimport file
. The only other collection I've found this issue with is ddr-csujad-13
in objects 1-15, 17,18, 27, 30-32 -- only master files.
Hey @GeoffFroh in the pseudocode, there's also the case of files that exist in /tmp/ddrshared
and for which we want to calculate the hashes, but we don't want to actually ingest. What should those be called?
There are four cases that ddrimport
needs to support: new-internal
, new-external-nobin
, new-external-bin
and update
See: https://docs.google.com/document/d/1d2PeMIh7GRQn-Qt9P61ETpKa7alL_oWH3hiBnnOM-mE/edit?usp=sharing
The update
case is not in the chart in the supplemental doc. The current code already detects that case based on whether the csv contains File
identifiers (i.e., indicating “update this File”) or an Entity
identifier (i.e., indicating “attach this new File to the specified Entity")
Logic flow:
if it is a new File import:
if the File is stored in the DDR git-annex (i.e., is internal):
('new-internal')
ignore 'sha1',
'sha256',
'md5',
'size',
'mimetype' values in csv
compute those values during ingest of 'basename_orig' binary
else (the File must be stored outside the DDR; must be external):
if checksum, size and mimetype vals present in csv:
(the File binary is not available locally for checksumming, etc.)
('new-external-nobin')
require and use 'sha1',
'sha256',
'md5',
'size',
'mimetype' values in csv
else (the File is locally available for checksumming):
('new-external-bin')
compute 'sha1', 'sha256', 'md5', 'size', 'mimetype' from binary
else (the operation must be a metadata-only update):
('update')
ignore 'basename_orig',
'external',
'role',
'sha1',
'sha256',
'md5',
'size',
'mimetype' values in csv
ddrimport file
with internal files and no sha1/sha256/md5/size columns present in CSV -- runs successfully
ddrimport file
with internal files and empty sha1/sha256/md5/size columns does not run successfully -- does not write data to field in file json
ddrimport file
with external files with data in sha1/sha256/md5/size -- runs successfully
ddrimport file
with external files and empty sha1/sha256/md5/size columns -- errors out -- traceback as text doc attacted
ddrimportfile_external_traceback.txt
I'll email @gjost the CSV I used to import files to ddr-testing-40274
Update: attached ddrimportfilecsvs.zip
Test cases
Fixed (hopefully) in ddr-cmdln commit 9ac1bfa for package ddrlocal-develop~2.8.7-4.
@sarabeckman Which cases are still not working for you?
Regarding the test CSVs, I also need the binary files you tested with, and it would really help if you could include the commands you used to test import everything. Otherwise I'm not really seeing the same thing as you.
Please have a look at the test CSVs in /opt/ddr-local/ddr-cmdln/ddr/test/ddrimport/
.
Here are the test CSVs in the order that the unit tests load them:
Order | Filename | Notes |
---|---|---|
1 | ddrimport-entity-new.csv | Creates some entities. |
2 | ddrimport-entity-update.csv | Updates the entities. |
3 | ddrimport-files-import-external.csv | Imports external files with hashes. |
4 | ddrimport-files-import-external-emptyhashes.csv | Tries to import external files with empty hashes (currently fails). |
5 | ddrimport-files-import-external-nohashes.csv | Tries to import external files with no hash columns (currently fails). |
6 | ddrimport-files-import-internal.csv | Imports internal files. At least one row in CSV has blank/empty hash columns. |
7 | ddrimport-files-import-internal-nohashes.csv | Imports internal files with no hash columns. |
8 | ddrimport-file-update.csv | Updates hash files |
The binary files for the CSVs I used for my testing can be found here: /media/qnfs/kinkura/working/csujadimport/processed/oh/ddr-csujad-29/csufccop_jaoh/csufccop_jaoh_preservation/
I took the binary files and CSVs that @gjost uses in his test suite and used them in a test in my own environment. I created ddr-testing-40276
. I imported ddrimport-entity-new.csv
using ddrimport entity
using my Dev VM. I then imported ddrimport-files-import-internal.csv
. The file with empty hash fields still had empty hash fields once imported. No errors where thrown during the import.
This is the only case the currently is reported to work in @gjost's tests but doesn't work in my dev VM.
@sarabeckman retested with three csv files from @gjost. 1) external with hash vals present in csv. No binaries. Worked as expected. 2) external with no hash vals present in csv (but empty cols exist). Binaries are present. Did not work. Copied bins with new ddr name, but did not insert vals into ddr entity.json files. 3) external with no hash cols or vals. Binaries are present. Worked as expected. (Binaries are hashed; hash vals are inserted into file json; binaries copied to new binaries with new ddr names) 4) internal with no hash vals present in csv (but empty cols exist) Did not work. Same as 2 above.
Tested develop_2.8.7-7 @gjost
external with no hash vals present in csv (but empty cols exist). Binaries are present. -- Import failed with traceback "Exception: No file ingest action for 'nobin, external, noattrs'" -- Personally this outcome is fine with me
internal with no hash vals present in csv (but empty cols exist) - Did not work. did not insert vals into file.json or fail as in external test above.
Traceback from test 1 is attached.
ExternalEmptyHash_traceback.txt
External with hashes present in CSV and External with no hash columns in CSV both worked successfully as they did yesterday.
working as intended. tseted with both internal and external with no hashes. 2.8.7-8
Today I was reviewing the metadata for the binary files I had imported for
ddr-densho-379
and I noticed that several important pieces of data weren't present including the thumb values, sha1, sha256, md5, and size. I've included a sample JSON file. I didn't receive any error messages when importing the files or when I committed or synced the collections. I used my local VM for the import using both master 2.7.1 and 2.8.0.This problem isn't present in everything I've imported. I noticed the missing data corresponds to imports I completed with CSV files that had thumb, sha1, sha256, md5, and size were columns in the CSV that were empty.
I thought the importer would create data for the attribute even with the empty value in the csv.
filemezzanine1.xlsx ddr-densho-379-1-master-dc0ae09755.txt