denshoproject / ddr-cmdln

Command-line tools for automating the Densho Digital Repository's various processes.
Other
0 stars 2 forks source link

ddrimport file update #226

Closed sarabeckman closed 1 year ago

sarabeckman commented 1 year ago

Part of the oral history workflow is using the ddrimport file update feature to include access images for the external files. I export the file csv add an "access_path" column for the signature image then import the CSV.

Once again the new validation step is triggered when the original files weren't in the repository.

I also got a new error message once I tried to run the import command with the files in the repository.

ddr-ajah-8-accessfiles.csv ddrimportfileupdate

gjost commented 1 year ago

I believe the fix for #225 also fixed this one, but I fixed a couple typos as well. Fixed in ddr-cmdln master branch commit 74981bb and pushed.

sarabeckman commented 1 year ago

Tested on kyuzo. Files located /media/qnfs/kinura/working/ddr-ajah-8/image

(cmdln) ddr@kyuzo:/media/qnfs/kinkura/working/ddr-ajah-8/image$ ddrimport file ./ddr-ajah-8-testaccessfiles.csv /media/qnfs/kinkura/gold/ddr-ajah-8
2023-02-08 16:31:47,299 DEBUG    <DDR.identifier.Identifier collection:ddr-ajah-8>
2023-02-08 16:31:47,300 INFO     Checking CSV file
2023-02-08 16:31:47,300 INFO     12 rows
2023-02-08 16:31:47,302 DEBUG    Starting new HTTPS connection (1): partner.densho.org:443
2023-02-08 16:31:47,530 DEBUG    https://partner.densho.org:443 "GET /vocab/api/0.2/index.json HTTP/1.1" 200 None
2023-02-08 16:31:47,548 DEBUG    getting vocab: https://partner.densho.org/vocab/api/0.2/genre.json
2023-02-08 16:31:47,549 DEBUG    getting vocab: https://partner.densho.org/vocab/api/0.2/language.json
2023-02-08 16:31:47,550 DEBUG    Starting new HTTPS connection (1): partner.densho.org:443
2023-02-08 16:31:47,551 DEBUG    getting vocab: https://partner.densho.org/vocab/api/0.2/facility.json
2023-02-08 16:31:47,551 DEBUG    Starting new HTTPS connection (1): partner.densho.org:443
2023-02-08 16:31:47,553 DEBUG    getting vocab: https://partner.densho.org/vocab/api/0.2/format.json
2023-02-08 16:31:47,553 DEBUG    Starting new HTTPS connection (1): partner.densho.org:443
2023-02-08 16:31:47,554 DEBUG    Starting new HTTPS connection (1): partner.densho.org:443
2023-02-08 16:31:47,555 DEBUG    getting vocab: https://partner.densho.org/vocab/api/0.2/rights.json
2023-02-08 16:31:47,557 DEBUG    Starting new HTTPS connection (1): partner.densho.org:443
2023-02-08 16:31:47,558 DEBUG    getting vocab: https://partner.densho.org/vocab/api/0.2/public.json
2023-02-08 16:31:47,560 DEBUG    Starting new HTTPS connection (1): partner.densho.org:443
2023-02-08 16:31:47,560 DEBUG    getting vocab: https://partner.densho.org/vocab/api/0.2/status.json
2023-02-08 16:31:47,562 DEBUG    Starting new HTTPS connection (1): partner.densho.org:443
2023-02-08 16:31:47,577 DEBUG    getting vocab: https://partner.densho.org/vocab/api/0.2/topics.json
2023-02-08 16:31:47,579 DEBUG    Starting new HTTPS connection (1): partner.densho.org:443
2023-02-08 16:31:47,600 DEBUG    https://partner.densho.org:443 "GET /vocab/api/0.2/language.json HTTP/1.1" 200 None
2023-02-08 16:31:47,608 DEBUG    https://partner.densho.org:443 "GET /vocab/api/0.2/format.json HTTP/1.1" 200 None
2023-02-08 16:31:47,617 DEBUG    https://partner.densho.org:443 "GET /vocab/api/0.2/public.json HTTP/1.1" 200 None
2023-02-08 16:31:47,619 DEBUG    https://partner.densho.org:443 "GET /vocab/api/0.2/rights.json HTTP/1.1" 200 None
2023-02-08 16:31:47,624 DEBUG    https://partner.densho.org:443 "GET /vocab/api/0.2/genre.json HTTP/1.1" 200 None
2023-02-08 16:31:47,626 DEBUG    https://partner.densho.org:443 "GET /vocab/api/0.2/facility.json HTTP/1.1" 200 None
2023-02-08 16:31:47,638 DEBUG    https://partner.densho.org:443 "GET /vocab/api/0.2/status.json HTTP/1.1" 200 None
2023-02-08 16:31:47,642 DEBUG    https://partner.densho.org:443 "GET /vocab/api/0.2/topics.json HTTP/1.1" 200 None
2023-02-08 16:31:47,683 INFO     Validating headers
2023-02-08 16:31:47,683 INFO     Validating rows
2023-02-08 16:31:47,687 INFO     Validating file imports
2023-02-08 16:31:47,687 INFO     Checking repository
2023-02-08 16:31:47,695 INFO     <git.repo.base.Repo '/media/qnfs/kinkura/gold/ddr-ajah-8/.git'>
2023-02-08 16:31:47,695 DEBUG    Popen(['git', 'diff', '--cached', '--name-only'], cwd=/media/qnfs/kinkura/gold/ddr-ajah-8, universal_newlines=False, shell=None, istream=None)
Traceback (most recent call last):
  File "/opt/ddr-cmdln/venv/cmdln/bin/ddrimport", line 33, in <module>
    sys.exit(load_entry_point('ddr-cmdln==5.6.1', 'console_scripts', 'ddrimport')())
  File "/opt/ddr-cmdln/venv/cmdln/lib/python3.9/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/opt/ddr-cmdln/venv/cmdln/lib/python3.9/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/opt/ddr-cmdln/venv/cmdln/lib/python3.9/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/ddr-cmdln/venv/cmdln/lib/python3.9/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/ddr-cmdln/venv/cmdln/lib/python3.9/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/opt/ddr-cmdln/venv/cmdln/lib/python3.9/site-packages/ddr_cmdln-5.6.1-py3.9.egg/DDR/cli/ddrimport.py", line 207, in file
    run_checks(
  File "/opt/ddr-cmdln/venv/cmdln/lib/python3.9/site-packages/ddr_cmdln-5.6.1-py3.9.egg/DDR/cli/ddrimport.py", line 304, in run_checks
    staged,modified = batch.Checker.check_repository(ci)
  File "/opt/ddr-cmdln/venv/cmdln/lib/python3.9/site-packages/ddr_cmdln-5.6.1-py3.9.egg/DDR/batch.py", line 182, in check_repository
    return dvcs.list_staged(repo), dvcs.list_modified(repo)
  File "/opt/ddr-cmdln/venv/cmdln/lib/python3.9/site-packages/ddr_cmdln-5.6.1-py3.9.egg/DDR/dvcs.py", line 282, in list_staged
    stdout = repo.git.diff('--cached', '--name-only')
  File "/opt/ddr-cmdln/venv/cmdln/lib/python3.9/site-packages/git/cmd.py", line 741, in <lambda>
    return lambda *args, **kwargs: self._call_process(name, *args, **kwargs)
  File "/opt/ddr-cmdln/venv/cmdln/lib/python3.9/site-packages/git/cmd.py", line 1315, in _call_process
    return self.execute(call, **exec_kwargs)
  File "/opt/ddr-cmdln/venv/cmdln/lib/python3.9/site-packages/git/cmd.py", line 1109, in execute
    raise GitCommandError(redacted_command, status, stderr_value, stdout_value)
git.exc.GitCommandError: Cmd('git') failed due to: exit code(129)
  cmdline: git diff --cached --name-only
  stderr: 'error: unknown option `cached'
usage: git diff --no-index [<options>] <path> <path>

Diff output format options
    -p, --patch           generate patch
    -s, --no-patch        suppress diff output
    -u                    generate patch
    -U, --unified[=<n>]   generate diffs with <n> lines context
    -W, --function-context
                          generate diffs with <n> lines context
    --raw                 generate the diff in raw format
    --patch-with-raw      synonym for '-p --raw'
    --patch-with-stat     synonym for '-p --stat'
    --numstat             machine friendly --stat
    --shortstat           output only the last line of --stat
    -X, --dirstat[=<param1,param2>...]
                          output the distribution of relative amount of changes for each sub-directory
    --cumulative          synonym for --dirstat=cumulative
    --dirstat-by-file[=<param1,param2>...]
                          synonym for --dirstat=files,param1,param2...
    --check               warn if changes introduce conflict markers or whitespace errors
    --summary             condensed summary such as creations, renames and mode changes
    --name-only           show only names of changed files
    --name-status         show only names and status of changed files
    --stat[=<width>[,<name-width>[,<count>]]]
                          generate diffstat
    --stat-width <width>  generate diffstat with a given width
    --stat-name-width <width>
                          generate diffstat with a given name width
    --stat-graph-width <width>
                          generate diffstat with a given graph width
    --stat-count <count>  generate diffstat with limited lines
    --compact-summary     generate compact summary in diffstat
    --binary              output a binary diff that can be applied
    --full-index          show full pre- and post-image object names on the "index" lines
    --color[=<when>]      show colored diff
    --ws-error-highlight <kind>
                          highlight whitespace errors in the 'context', 'old' or 'new' lines in the diff
    -z                    do not munge pathnames and use NULs as output field terminators in --raw or --numstat
    --abbrev[=<n>]        use <n> digits to display object names
    --src-prefix <prefix>
                          show the given source prefix instead of "a/"
    --dst-prefix <prefix>
                          show the given destination prefix instead of "b/"
    --line-prefix <prefix>
                          prepend an additional prefix to every line of output
    --no-prefix           do not show any source or destination prefix
    --inter-hunk-context <n>
                          show context between diff hunks up to the specified number of lines
    --output-indicator-new <char>
                          specify the character to indicate a new line instead of '+'
    --output-indicator-old <char>
                          specify the character to indicate an old line instead of '-'
    --output-indicator-context <char>
                          specify the character to indicate a context instead of ' '

Diff rename options
    -B, --break-rewrites[=<n>[/<m>]]
                          break complete rewrite changes into pairs of delete and create
    -M, --find-renames[=<n>]
                          detect renames
    -D, --irreversible-delete
                          omit the preimage for deletes
    -C, --find-copies[=<n>]
                          detect copies
    --find-copies-harder  use unmodified files as source to find copies
    --no-renames          disable rename detection
    --rename-empty        use empty blobs as rename source
    --follow              continue listing the history of a file beyond renames
    -l <n>                prevent rename/copy detection if the number of rename/copy targets exceeds given limit

Diff algorithm options
    --minimal             produce the smallest possible diff
    -w, --ignore-all-space
                          ignore whitespace when comparing lines
    -b, --ignore-space-change
                          ignore changes in amount of whitespace
    --ignore-space-at-eol
                          ignore changes in whitespace at EOL
    --ignore-cr-at-eol    ignore carrier-return at the end of line
    --ignore-blank-lines  ignore changes whose lines are all blank
    -I, --ignore-matching-lines <regex>
                          ignore changes whose all lines match <regex>
    --indent-heuristic    heuristic to shift diff hunk boundaries for easy reading
    --patience            generate diff using the "patience diff" algorithm
    --histogram           generate diff using the "histogram diff" algorithm
    --diff-algorithm <algorithm>
                          choose a diff algorithm
    --anchored <text>     generate diff using the "anchored diff" algorithm
    --word-diff[=<mode>]  show word diff, using <mode> to delimit changed words
    --word-diff-regex <regex>
                          use <regex> to decide what a word is
    --color-words[=<regex>]
                          equivalent to --word-diff=color --word-diff-regex=<regex>
    --color-moved[=<mode>]
                          moved lines of code are colored differently
    --color-moved-ws <mode>
                          how white spaces are ignored in --color-moved

Other diff options
    --relative[=<prefix>]
                          when run from subdir, exclude changes outside and show relative paths
    -a, --text            treat all files as text
    -R                    swap two inputs, reverse the diff
    --exit-code           exit with 1 if there were differences, 0 otherwise
    --quiet               disable all output of the program
    --ext-diff            allow an external diff helper to be executed
    --textconv            run external text conversion filters when comparing binary files
    --ignore-submodules[=<when>]
                          ignore changes to submodules in the diff generation
    --submodule[=<format>]
                          specify how differences in submodules are shown
    --ita-invisible-in-index
                          hide 'git add -N' entries from the index
    --ita-visible-in-index
                          treat 'git add -N' entries as real in the index
    -S <string>           look for differences that change the number of occurrences of the specified string
    -G <regex>            look for differences that change the number of occurrences of the specified regex
    --pickaxe-all         show all changes in the changeset with -S or -G
    --pickaxe-regex       treat <string> in -S as extended POSIX regular expression
    -O <file>             control the order in which files appear in the output
    --find-object <object-id>
                          look for differences that change the number of occurrences of the specified object
    --diff-filter [(A|C|D|M|R|T|U|X|B)...[*]]
                          select files by diff type
    --output <file>       Output to a specific file
gjost commented 1 year ago

That's an interesting one. Guess it's not fixed after all...

gjost commented 1 year ago

The git.exc.GitCommandError is because /media/qnfs/kinkura/gold was owned by ansible.ansible for some reason. The error above doesn't say anything about permissions but it goes away when you chown -R ddr.ddr the repo.

gjost commented 1 year ago

This is fixed on the ddr-cmdln develop branch as of commit a403f7ac8c.

When loading objects from CSV the Identifier.basepath is not set. I believe my thinking was they're in a working directory and not in their final location i.e. the repository path. DDR.models.common.load_csv compares field values in rowd objects (objects from a row in the CSV) with existing ones to mark field values that are modified. Identifier objects are considered to be non-equal if their values of path_abs() are different.

In this case, DDR.identifier.MissingBasepathException is triggered because Identifier.path_abs() requires a basepath value, which has not been set on the rowd object's Identifier. The fix is to modifiy DDR.models.common.load_csv to set a temporary basepath just before doing this comparison. (I also added a note to the DDR.identifier.Identifier.__eq__ documentation noting what DDR.models.common.load_csv is doing.)

A better fix would be to modify DDR.identifier.Identifier.__eq__ to accept an ignore_basepath argument, but that did not work in testing.