denshoproject / ddr-cmdln

Command-line tools for automating the Densho Digital Repository's various processes.
Other
0 stars 2 forks source link

DDR.ingest.add_file adds files to repo rather than annex #159

Closed GeoffFroh closed 4 years ago

GeoffFroh commented 4 years ago

ddr-cmdln on the 2.8.8 master package improperly adds binary files directly to the collection repo, rather than the annex.

This behavior is a regression from the previous master package.

GeoffFroh commented 4 years ago

See attached notes on affected repo ddr-densho-390 and remediation measures.

201907-ddr-densho-390.txt

gjost commented 4 years ago

I added functions which determine whether a given file belongs to Git or to git-annex: DDR.dvcs.file_in_git_objects and DDR.dvcs.file_in_git_annex. I added a function that gets called at the end of each test in ddr-cmdln/ddr/tests/test_ddrimport.py. As expected, this test failed and reported binaries in the .git/objects/ dir.

gjost commented 4 years ago

I've identified the source of the problem.

The new DDR.ingest.add_files function attempts to enact a bunch of new ddrimport behaviors, and replaces the old add_local_file and add_external_file functions.

DDR.ingest.add_files builds two lists: git_files and annex_files. These are passed to the DDR.ingest.stage_files function, which git adds textual metadata files and git annex adds binary files.

In the previous DDR.ingest.add_local_file and DDR.ingest.add_external_file functions the git_files (text) and annex_files (binary) lists were assembled manually. In the new DDR.ingest.add_file file, annex_files is still assembled manually, but git_files now comes from DDR.models.files.File.save, which returns a list of modified files along with a status bit and an ok/error message. File.save includes all the files associated with a File object, including the binaries. This is the source of the problem.

I've added code to the DDR.ingest.stage_files function that removes files in the annex_files list from the git_files list. It may also work to simply stage the annex files first, but I haven't explored that yet.

gjost commented 4 years ago

I believe the modification to DDR.ingest.stage_files solves the immediate problem, but that function is only used in DDR.ingest. That's probably OK because the application does not (should not) add binary files in any other location. It would be good for all file staging in the application to happen in a single function in DDR.dvcs which would examine each file before staging (perhaps using mimetypes) and puts everything in its right place.

gjost commented 4 years ago

I have a combination of fixes:

I've also added functions to the DDR.dvcs module for checking whether a given file is an annex file or a regular Git file and added checks to ddrimport unit tests.

gjost commented 4 years ago

Fixed in ddr-cmdln commit 888d7c6 for package ddrlocal-develop 2.8.8-1~deb9.