artefactual / automation-tools

Tools to aid automation of Archivematica and AtoM.
GNU Affero General Public License v3.0
46 stars 33 forks source link

Fix premis:originalName parsing, refs #128 #129

Closed djjuhasz closed 2 years ago

djjuhasz commented 2 years ago

Use os.path.basename() to get file names from the <premis:originalName> path.

The previously assumed path of %transferDirectory%objects is not valid for BagIt transfers, which lead to truncating the first three characters of the filename in the DIP objects zip file.

tw4l commented 2 years ago

@djjuhasz Based on the paths expected in the tests, it looks like we don't always want just the basename for files in an AIP, but the full relative path from the root directory (which we're currently assuming is always %transferDirectory%objects, but may actually be %transferDirectory%data if I'm understanding correctly).

It might be worth looking at os.path.relpath or check explicitly for the full strings we want to remove and if either is found, splice them from the beginning of the string.

djjuhasz commented 2 years ago

@tw4l I've updated the code to return the relative paths from the root directory.

djjuhasz commented 2 years ago

@tw4l third time is the charm? 😆

djjuhasz commented 2 years ago

@tw4l nevermind, the unit tests failed in Python 2.7 😢

tw4l commented 2 years ago

@djjuhasz It looks likeunittest.TestCase.assertLogs is new in Python 3.4, which would explain the failing test for 2.7: https://docs.python.org/3/library/unittest.html?highlight=unittest#unittest.TestCase.assertLogs. You could change the test case to verify that the function returns None for an invalid path?

djjuhasz commented 2 years ago

@tw4l four to go? 😜

tw4l commented 2 years ago

@djjuhasz LGTM!

No need to change this, but looking at the tests again it occurred to me that you might be interested in pytest's parametrize functionality.

It enables combining tests like these and testing different combinations of inputs and outputs, so that these three tests:

 def test_get_original_relpath_objects_dir(self):
    path = "%transferDirectory%objects/folder1/file5.txt"

    assert create_dip.get_original_relpath(path) == "folder1/file5.txt"

def test_get_original_relpath_data_dir(self):
    path = "%transferDirectory%data/folder1/file5.txt"

    assert create_dip.get_original_relpath(path) == "folder1/file5.txt"

def test_get_original_relpath_warn_invalid_prefix(self):
    path = "%transferDirectory%datas/folder1/file5.txt"

    assert create_dip.get_original_relpath(path) is None

can become, e.g.:

import pytest

...

@pytest.mark.parametrize(
    "path,expected_return",
    [
        # %transferDirectory%objects prefix
        (
            "%transferDirectory%objects/folder1/file5.txt",
            "folder1/file5.txt"
        ),
        # %transferDirectory%data prefix
        (
            "%transferDirectory%data/folder1/file5.txt",
            "folder1/file5.txt"
        ),
        # Invalid prefix
        (
            "%transferDirectory%datas/folder1/file5.txt",
            None
        ),
    ],
)
def test_get_original_relpath(path, expected_return):
    assert create_dip.get_original_relpath(path) == expected_return
sevein commented 2 years ago

2ee73fa9cc43ab0b6d6395bc4aa8173215935c36 and e5a9f8077dfa19c87897d2f41610636d4faaf9b8