Closed GeoffFroh closed 4 years ago
See attached notes on affected repo ddr-densho-390
and remediation measures.
I added functions which determine whether a given file belongs to Git or to git-annex: DDR.dvcs.file_in_git_objects
and DDR.dvcs.file_in_git_annex
. I added a function that gets called at the end of each test in ddr-cmdln/ddr/tests/test_ddrimport.py
. As expected, this test failed and reported binaries in the .git/objects/
dir.
I've identified the source of the problem.
The new DDR.ingest.add_files
function attempts to enact a bunch of new ddrimport
behaviors, and replaces the old add_local_file
and add_external_file
functions.
DDR.ingest.add_files
builds two lists: git_files
and annex_files
.
These are passed to the DDR.ingest.stage_files
function, which git add
s textual metadata files and git annex add
s binary files.
In the previous DDR.ingest.add_local_file
and DDR.ingest.add_external_file
functions the git_files
(text) and annex_files
(binary) lists were assembled manually.
In the new DDR.ingest.add_file
file, annex_files
is still assembled manually, but git_files
now comes from DDR.models.files.File.save
, which returns a list of modified files along with a status bit and an ok/error message.
File.save
includes all the files associated with a File
object, including the binaries.
This is the source of the problem.
I've added code to the DDR.ingest.stage_files
function that removes files in the annex_files
list from the git_files
list. It may also work to simply stage the annex files first, but I haven't explored that yet.
I believe the modification to DDR.ingest.stage_files
solves the immediate problem, but that function is only used in DDR.ingest
. That's probably OK because the application does not (should not) add binary files in any other location. It would be good for all file staging in the application to happen in a single function in DDR.dvcs
which would examine each file before staging (perhaps using mimetypes) and puts everything in its right place.
I have a combination of fixes:
DDR.ingest.stage_files
: Remove files from git_files
list if they're present in annex_files
.DDR.ingest.stage_files
: Stage annex files before staging regular git files.DDR.models.files.File.save
: Do not add binary files to the updated_files
list returned. This list is used in DDR.ingest.add_file
as the source for git_files
.
These changes are redundant: staging binary files to annex first takes care of the problem in the absence of the other changes.I've also added functions to the DDR.dvcs
module for checking whether a given file is an annex file or a regular Git file and added checks to ddrimport
unit tests.
git annex lookupkey
.git hash-object
then uses the hash to construct a path in .git/objects
. This method works when files are first committed, but breaks once Git garbage collection puts them in a packfile. I'm currently researching a better method for this.Fixed in ddr-cmdln commit 888d7c6 for package ddrlocal-develop 2.8.8-1~deb9.
ddr-cmdln
on the 2.8.8 master package improperly adds binary files directly to the collection repo, rather than the annex.This behavior is a regression from the previous master package.