CABLE-LSM / benchcab

Tool for evaluation of CABLE land surface model
https://benchcab.readthedocs.io/en/latest/
Apache License 2.0
2 stars 4 forks source link

`path` set incorrectly in `Repo` classes #243

Closed ccarouge closed 6 months ago

ccarouge commented 9 months ago

Description

If I give a name to a realisation (eg. toto) checking out the main branch of the CABLE repository, the path for the repo should be src/toto instead it is set to src/toto/main. That is using branch 234-clean-parsing-optional.

Then this crashes in add_revision which is about the only place that is using the repo.path variable.

Solution

If we don't want to change too many things, the simpler is in GitRepo.__init__() and SVNRepo.__init(), replace:

        self.path = path / branch if path.is_dir() else path

with

        self.path = path / branch if path == internal.SRC_DIR else path

But considering create_repo is called with:

                repo = create_repo(
                    spec=sub_config.pop("repo"),
                    path=internal.SRC_DIR / name if name else internal.SRC_DIR,
                )

this seems heavy...

Should we instead pass name to repo._init__() and then have:

        self.path = internal.SRC_DIR / name if name else internal.SRC_DIR / branch

How to reproduce

In integration.sh use:

cat > config.yaml << EOL
project: $PROJECT

realisations:
  - repo:
      git:
        branch: main
    name: toto

modules: [
  intel-compiler/2021.1.1,
  netcdf/4.7.4,
  openmpi/4.1.0
]

fluxsite:
  experiment: AU-Tum
  pbs:
    storage:
      - scratch/$PROJECT
EOL

Output

In the PBS log file:

multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
  File "/scratch/tm70/ccc561/conda/envs/benchcab-dev/lib/python3.9/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/scratch/tm70/ccc561/conda/envs/benchcab-dev/lib/python3.9/multiprocessing/pool.py", line 48, in mapstar
    return list(map(*args))
  File "/home/561/ccc561/benchcab/benchcab/fluxsite.py", line 275, in run
    self.add_provenance_info(verbose=verbose)
  File "/home/561/ccc561/benchcab/benchcab/fluxsite.py", line 327, in add_provenance_info
    "svn_revision_number": self.model.repo.get_revision(),
  File "/home/561/ccc561/benchcab/benchcab/utils/repo.py", line 117, in get_revision
    repo = git.Repo(self.path)
  File "/scratch/tm70/ccc561/conda/envs/benchcab-dev/lib/python3.9/site-packages/git/repo/base.py", line 224, in __init__
    raise NoSuchPathError(epath)
git.exc.NoSuchPathError: /scratch/tm70/ccc561/benchcab/integration/src/toto/main
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/561/ccc561/.local/bin/benchcab", line 33, in <module>
    sys.exit(load_entry_point('benchcab', 'console_scripts', 'benchcab')())
  File "/home/561/ccc561/benchcab/benchcab/main.py", line 30, in main
    parse_and_dispatch(parser)
  File "/home/561/ccc561/benchcab/benchcab/main.py", line 20, in parse_and_dispatch
    func(**args)
  File "/home/561/ccc561/benchcab/benchcab/benchcab.py", line 251, in fluxsite_run_tasks
    run_tasks_in_parallel(tasks, n_processes=ncpus, verbose=verbose)
  File "/home/561/ccc561/benchcab/benchcab/fluxsite.py", line 366, in run_tasks_in_parallel
    pool.map(run_task, tasks, chunksize=1)
  File "/scratch/tm70/ccc561/conda/envs/benchcab-dev/lib/python3.9/multiprocessing/pool.py", line 364, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "/scratch/tm70/ccc561/conda/envs/benchcab-dev/lib/python3.9/multiprocessing/pool.py", line 771, in get
    raise self._value
git.exc.NoSuchPathError: /scratch/tm70/ccc561/benchcab/integration/src/toto/main
ccarouge commented 9 months ago

@SeanBryan51 any views on this? I found it using the branch for #234 but it is possible it is also occurring with the main branch (not tested). So we need to add a case to our integration test...

SeanBryan51 commented 9 months ago

If I give a name to a realisation (eg. toto) checking out the main branch of the CABLE repository, the path for the repo should be src/toto instead it is set to src/toto/main.

That is strange, is the toto directory is being created before the git clone? That would cause it to clone it in the subdirectory src/toto/main. See here: https://github.com/CABLE-LSM/benchcab/blob/af43dad820811bc0b5e56161957ca374aeeb704e/benchcab/utils/repo.py#L74-L77

I've tested your integration test case for the main branch and it seems to be working as expected:

$ ls $SCRATCH/benchcab/integration/src/toto
documentation  License.md  README.md  requirements.txt  src  src_pop
ccarouge commented 9 months ago

@SeanBryan51 The checkout and compilation work absolutely fine and are using src/toto, but that's because src/toto directory doesn't exist yet.

The only problem I had was in the PBS log file, when it is trying to get the revision/version "number" to write in the netcdf files. It's when calling add_provenance_info from fluxsite.py and get_revision from repo.py. That's the only place where I see a failure. It's because at this point we call _get_models again and the GitRepo() gets recreated. But now, the src/toto directory exists, so it adds the branch name to it when defining path.

SeanBryan51 commented 9 months ago

Yep, I've reproduced the error in the PBS logs. The logic in the constructor is definitely broken for the case you described. I guess one way we could solve this is to check if the path points to an existing git repository, and set the path correctly in that case?

ccarouge commented 9 months ago

So if path is git repo -> path = path, else path = path / branch ?

ccarouge commented 9 months ago

How do you propose to check if it's a git repo? Is there anything in gitpython? Or just checking for a .git/ subfolder?

SeanBryan51 commented 9 months ago

This does also bring up an issue I have with the design of the Repo object which is that the state of the object does not necessarily reflect the state of the cloned repository on the file system (this issue is an example of this). Perhaps a better approach is to do what gitpython does in the sense that a git.Repo object can only be instantiated if you give it a path to an existing git repository on the file system, or you create one via git.Repo.clone_from() which clones the repository, that way the state of the object always reflects the state of the git repository on the file system.

SeanBryan51 commented 9 months ago

So if path is git repo -> path = path, else path = path / branch ?

Something like that yep.

SeanBryan51 commented 9 months ago

Maybe relevant: https://github.com/gitpython-developers/GitPython/issues/715