core-unit-bioinformatics / reference-container

Build repository for reference container
MIT License
0 stars 0 forks source link

git labels: replicate in template #8

Closed ptrebert closed 1 year ago

ptrebert commented 1 year ago

@svenwillger As just discussed, move this functionality to the snakemake template, and add a function to find the correct name of the remote if multiple remotes are configured (github takes priority over githhu). If only one remote exists, take that one of course.

https://github.com/core-unit-bioinformatics/reference-container/blob/c1b9964c82cfc8dcdee3d999b1bc10f1cf2e850c/workflow/rules/functions.smk#L530

svenwillger commented 1 year ago

I created the new branch git_labels in the repo template-snakemake. I moved the function _collect_git_labels() into the module pyutils.smkand added another function _collect_git_remotes() to collect the url of the remote in the priority order github, git.hhu, origin (If no origin is present an error message "This is not a git repository" will be printed into the "git-url" field. If we want to also extract the clear name (like "origin") of the remote I'll have to adjust the code but that information is user specific anyway. I also included a test rule in snaketest.smk to create a text file that contains the output of the _collect_git_labels function.

ptrebert commented 1 year ago

This looks quite close to something that could be merged, but there a few remaining problems:

    wd = WORKDIR

    remotes = sp.check_output(["git remote -v"], shell=True, cwd=wd).decode().split()
  1. the top-level import of subprocess as sp breaks with the function _rsync, and short hands at the top-level of a Snakefile (!) should be avoided to not clutter the namespace
  2. this will throw an unhandled exception if the git executable is not available, which may be the case in an HPC environment. Maybe this should be checked first to decide whether or not to run the rest of the code. If git is not available, the pipeline can still run, so this should just be issued as a warning.
  3. the reference to workdir seems wrong, because the workdir is (see constants) the snakemake working directory, and not the directory of the checked out pipeline code (= the repo directory)
  4. style: packaging the command string in a single-entry list seems unnecessary, see function _rsync for reference or just use the plain string

Lines 116-125 could be simplified by counting on the fact that the remote name is always the first item in the line after a .split(). The error message in line 127 is misleading, though, because it actually just means that the remote name is something quite unexpected (no rule against naming your remote, say, pizza).

What is also missing is adding this info (just like timestamp and username) to the config dump.

svenwillger commented 1 year ago

In repsonse to the points:

  1. I took out the "as sp" and only work with "subprocess" and not the abriviation in the code.
  2. I added a check if git is an executable and raise a warning if that is not the case.
  3. I changed it to DIR_SNAKEFILE since this should be the directory where the pipeline code should be in
  4. I changed it to use the plain string.

Initially, I had it the way you suggested but then realized that everybody can name the remote, say pizza and maybe use other locations other then git where the remote is stored and then it doesn't work properly or at all to just select the first item after a .split() The way I wrote it now is to not use the remote name but directly the url and look for "github" or "git.hhu" within the url and return the first occurrence and if that doesn't result in a match use the remote name (which by default should be "origin") and use that url. I agree that it looks messy with multiple "try-except" back-to-back but that was my best solution I could come up with.

I changed the text of the error message.

I also added the info into the config dump.

ptrebert commented 1 year ago

I think this can be marked as done, the template contains the git label functionality.