datalad / datalad-ukbiobank

Resources for working with UKBiobank as a DataLad dataset
MIT License
6 stars 12 forks source link

BF: introduce helper to ensure newline characters in .gitattributes #85

Closed adswa closed 2 years ago

adswa commented 2 years ago

Hopefully a fix for #82

Within datalad core, https://github.com/datalad/datalad/commit/fa96e6a introduced a newline to .gitattributes largefile configurations written by datalad at repo initialization. This change has been a part of the code base since version 0.15, if I am not mistaken. The presence of the newline causes a mismatch between what should be identical .gitattributes files across two branches, ultimately causing a merge conflict and resulting in https://github.com/datalad/datalad-ukbiobank/issues/82. This is because the .gitattributes file of one branch is written using pathlibs write_text method with input from a cat-file -p command on the main branch. As write_text() does not support addition of a newline character until Python 3.10, the most feasible way to ensure the presence of the newline is to append it if it isn't there via string operations. In this change, this is done with the introduction of a helper function. As this code base does not have any utils.py and the helper is small, I added it directly into init.py - if there are more helpers introduced, it would probably make sense to create a utils.py for them eventually.

Because the helper causes the same merge conflict that is currently happening with datalad versions 0.15 and higher for datalad versions lower than 0.15, I'm (lazily) bumping the minimal version requirement in setup.cfg to 0.15. I guess there would be ways to cleverly sense which DataLad version we're dealing with, but I think its reasonable and cheaper to expect recent-ish datalad versions.

adswa commented 2 years ago

Closing, since this fix is wrong. The failures reported in #82 and the wrong fix and subsequent failure in this PR revealed that the output of any repo.call_git() call is stripped of a last lines trailing slash. This seems to have been introduced by you in https://github.com/datalad/datalad/pull/6278 (ping @christian-monch waving for attention :wave: ) - I'm not familiar enough with this code base to touch it confidently, but this patch in core fixes it for me and it PRed as WIP in datalad/datalad#6754:

diff --git a/datalad/dataset/gitrepo.py b/datalad/dataset/gitrepo.py
index a1e32b450..d7b831bca 100644
--- a/datalad/dataset/gitrepo.py
+++ b/datalad/dataset/gitrepo.py
@@ -429,12 +429,12 @@ class GitRepo(RepoInterface, metaclass=PathBasedFlyweight):
         ------
         CommandError if the call exits with a non-zero status.
         """
-        return "\n".join(
-            self.call_git_items_(args,
-                                 files,
-                                 expect_stderr=expect_stderr,
-                                 expect_fail=expect_fail,
-                                 read_only=read_only))
+
+        return self._call_git(args,
+                 files,
+                 expect_stderr=expect_stderr,
+                 expect_fail=expect_fail,
+                 read_only=read_only)[0]

     def call_git_items_(self,
                         args,

Still, having someone knowledgable look over it would be cool. There were also a few comments by @yarikoptic in the chat:

BTW _call_git docstring needs fixing:

        The parameters, return value, and raised exceptions match those
        documented for `call_git`.

and

that is odd -- worth checking if call_git atm just passes stderr to screen?

yarikoptic commented 2 years ago

FWIW my 1c: I would not call this fix "wrong" -- it just ensures our assumption that .gitattributes files should have trailing new line. Waiting for a fix in datalad core might take a bit, and when it is introduced, it should not change this assumption etc. The only cons is "duplicating ensuring assumptions" so if in the future we decide that there must be no trailing new line -- we have a problem. But how likely that to happen in this case? very unlikely imho thus I would have just did this workaround fix here.