ACCESS-NRI / accessdev-Trac-archive

Archive accessdev Trac contents as issues
Apache License 2.0
0 stars 0 forks source link

Move prebuilds out of ~access #167

Open penguian opened 9 years ago

penguian commented 9 years ago

keyword_TIWG | by saw562@nci.org.au


The ~access directory is currently managed using a subversion repository. This makes it difficult to add new prebuilds.

Look into moving prebuilds to a different location (/g/data1/access?)


Issue migrated from trac:167 at 2024-01-31 18:10:32 +1100

penguian commented 9 years ago

@scott.wales@bom.gov.au changed status from new to accepted

penguian commented 9 years ago

@scott.wales@bom.gov.au changed owner from ` tosaw562`

penguian commented 9 years ago

@scott.wales@bom.gov.au commented


Blocked by #166

penguian commented 9 years ago

@scott.wales@bom.gov.au commented


Now that /g/data1/access is mounted I think that's a good location for the prebuilds, since it can be seen from both machines. Probably with a directory structure like

/g/data1/access/prebuilds/vn10.1/vn10.1_safe_noomp

Will run some tests first to check that this will work

penguian commented 9 years ago

@scott.wales@bom.gov.au commented


Related UM ticket https://code.metoffice.gov.uk/trac/um/ticket/479

penguian commented 9 years ago

@martin.dix@anu.edu.au commented


See https://accessdev.nci.org.au/trac/wiki/access/RoseSuitePrebuilds for some progress and issues.

Creating prebuilds from a suite is going to use the normal $HOME/cylc-run/SUITE/share structure, though perhaps the suite could override this. Moving a prebuild directory is tricky at the moment, https://github.com/metomi/fcm/issues/185.

Also we'll likely need separate directories for the extract and build steps, otherwise fcm clobbers its own state, https://github.com/metomi/fcm/issues/126.

penguian commented 9 years ago

@scott.wales@bom.gov.au commented


FCM lets you configure the build directory using the --directory flag, which can be added to the ROSE_TASK_OPTIONS environment variable in the Cylc task.

Trying this out however I get errors like

[FAIL] /g/data/access/prebuild/vn10.1/safe_noomp/.fcm-make/cache/extract/jules/0: cannot create
[FAIL] {'e' => 'svn: E000116: Can\'t create temporary file from template \'/g/data/access/prebuild/vn10.1/safe_noomp/.fcm-make/cache/extract/jules/0/vm/svn-XXXXXX\': Stale file handle

it seems like subversion doesn't like running on the NFS mount.

The failing command is

svn export https://130.56.244.76/svn/jules/main/trunk@692 /g/data/access/prebuild/vn10.1/safe_noomp/.fcm-make/cache/extract/jules/0
penguian commented 9 years ago

@martin.dix@anu.edu.au commented


Trying to checkout rose-meta to /g/data/access also failed in the same way

% fcm checkout fcm:um.xm/trunk/rose-meta         
A    rose-meta/um-fcm-make
.....
A    rose-meta/um-fcm-make/versions.py
svn: E000116: Can't create temporary file from template '/g/data1/access/rose-meta/.svn/tmp/svn-XXXXXX': Stale file handle
[FAIL] svn checkout https://130.56.244.76/svn/um/main/trunk/rose-meta # rc=1
penguian commented 9 years ago

@chris.allen@anu.edu.au commented


I tried:

svn checkout svn+ssh://accessdev.nci.org.au/home/access-svn/roses_test_svn test

It fails in random places either with the same error or with other filesystem metadata related problems. For example:

21532 11:43:07 open("/g/data1/access.dev/tmp/cma900/test3/.svn/tmp/svn-oUeTOX", O_RDWR|O_CREAT|O_EXCL, 0600 <unfinished ...>
21532 11:43:07 <... open resumed> )     = -1 ESTALE (Stale file handle)

And this is a sign that things aren't well on /g/data:

$ mkdir tmp
$ cd tmp
bash: cd: tmp: Permission denied
(a few seconds later)
$ cd tmp
$

Interestingly, so far I've got the checkout to work every time if I do it in a directory tree which doesn't have any ACLs.

Anyway, there does seem to be some problems with /g/data at the moment so until those are sorted out I wouldn't rely on anything you're seeing. I'll report what I've observed to the storage team.

penguian commented 9 years ago

@scott.wales@bom.gov.au changed keywords from ` toTIWG`

penguian commented 9 years ago

@scott.wales@bom.gov.au commented


It looks like today's maintenance hasn't fixed this issue, still getting stale file handle reports

penguian commented 9 years ago

@chris.allen@anu.edu.au commented


OK, I'll chase it up.

penguian commented 9 years ago

@chris.allen@anu.edu.au commented


I managed to reproduce the problem with a simple Python script (turns out that we also see other errors without ACLs) and the storage team believe they have located a potential bug and are now waiting on some Lustre patches from upstream.

penguian commented 9 years ago

@scott.wales@bom.gov.au commented


Great, thanks for the update

penguian commented 9 years ago

@chris.allen@anu.edu.au commented


I'm still seeing errors after today's maintenance - reported to storage team.

penguian commented 9 years ago

@martin.dix@anu.edu.au commented


FCM 2015.05.0 adds named builds so that the extract and build steps can run in the same directory without interfering with each other.

There are now two possible ways of setting up the prebuilds in /g/data/access.

1 With new fcm, prebuild creation suite.rc has

{% set prebuild_path = '/g/data/access/prebuilds/vn10.2/fieldcalc' %}

    [[fcm_make]]
       [[[environment]]]
          ROSE_TASK_OPTIONS = --directory=[prebuild_path] mirror.target=[prebuild_path]            

    [[fcm_make2]]
      [[[environment]]]
            ROSE_TASK_OPTIONS = --directory=[prebuild_path] --name=2

and app/fcm_make/file/fcm_make.cfg has

mirror.prop{config-file.name} = 2

Job using the prebuild has

    [[fcm_make]]
         [[[environment]]]
            PREBUILD = [prebuild_path]

    [[fcm_make2]]
        [[[environment]]]
            PREBUILD = [prebuild_path]
            ROSE_TASK_OPTIONS = --name=2

Suites au-aa360 and au-aa361 are an example that builds just the UM fieldcalc utility.

2 Alternately use separate sub-directories. Creation suite uses

{% set prebuild_path = '/g/data/access/prebuilds/vn10.2/fieldcalc_alt' %}
{% set extract_prebuild = prebuild_path + '/extract' %}
{% set build_prebuild   = prebuild_path + '/build'   %}

   [[fcm_make]]
        [[[environment]]]
            ROSE_TASK_OPTIONS = --directory=[extract_prebuild] mirror.target=[build_prebuild]            

    [[fcm_make2]]
        [[[environment]]]
            ROSE_TASK_OPTIONS = --directory=[build_prebuild]

and job using the prebuild has

    [[fcm_make]]
       [[[environment]]]
            PREBUILD = [extract_prebuild]

    [[fcm_make2]]
        [[[environment]]]
            PREBUILD = [build_prebuild]

Suites au-aa362 and au-aa363 are an example of this.

I don't have a strong preference for which style we use. In each case the job using the prebuild has to set something different in the two fcm_make tasks.

PS I had a couple of errors doing the extraction to /g/data/access, but it seems better than it was a few months ago.

penguian commented 9 years ago

@martin.dix@anu.edu.au changed _comment0 which not transferred by tractive

penguian commented 8 years ago

@jin.lee@bom.gov.au commented


The problem reported earlier (see comment:7) regarding checkout to /g/data/ seems to be not completely fixed:

accessdev:/g/data/dp9/jtl548/source/ops> fcm co fcm:ops.x/branches/dev/jinlee/r269_ops32.0.0_nci
...
...
A    r269_ops32.0.0_nci/src/public/OpsMod_SignalHandler/OpsMod_SignalHandler.f90
A    r269_ops32.0.0_nci/src/public/OpsMod_SignalHandler/sigtrap.c
svn: E155009: Failed to run the WC DB work queue associated with '/g/data1/dp9/jtl548/source/ops/r269_ops32.0.0_nci/src/public/OpsMod_SignalHandler', work item 3559 (file-install src/public/OpsMod_SignalHandler/OpsMod_SignalHandler.f90 1 0 1 1)
svn: E000013: Can't move '/g/data1/dp9/jtl548/source/ops/r269_ops32.0.0_nci/.svn/tmp/svn-DWuBzq' to '/g/data1/dp9/jtl548/source/ops/r269_ops32.0.0_nci/src/public/OpsMod_SignalHandler/OpsMod_SignalHandler.f90': Permission denied [FAIL] svn checkout https://code.metoffice.gov.uk/svn/ops/main/branches/dev/jinlee/r269_ops32.0.0_nci # rc=1

Can someone able to revisit this problem?

penguian commented 8 years ago

@chris.allen@anu.edu.au commented


Storage team notified that we're still running into these filesystem errors.

penguian commented 8 years ago

@scott.wales@bom.gov.au commented


According to Jin checkouts work on raijin to /g/data, but not on accessdev

penguian commented 8 years ago

@chris.allen@anu.edu.au commented


At the moment we're being advised that the upstream vendor expects to have a patch available before the end of this month.

penguian commented 8 years ago

@chris.allen@anu.edu.au commented


It appears that the /g/data1 filesystem errors seen from VMs should be resolved now. Could you please try again.

penguian commented 8 years ago

@martin.dix@anu.edu.au commented


Rose now automatically does --name=2 for fcm_make2 tasks (since 2015.05.0), https://github.com/metomi/rose/pull/1604. Now option 1 above is clearly preferable because the job using the prebuild doesn't have to do anything special.

See access/RoseSuitePrebuilds for instructions on creating prebuilds from a rose stem job.

penguian commented 8 years ago

@martin.dix@anu.edu.au changed _comment0 which not transferred by tractive

penguian commented 8 years ago

@martin.dix@anu.edu.au commented


An fcm issue meant that regular access group members couldn't use the prebuilds. See https://github.com/metomi/fcm/issues/226.

fcm/2016.02.0 patched on raijin to work around this.

It's fixed in fcm/2016.03.0.

penguian commented 8 years ago

@martin.dix@anu.edu.au changed _comment0 which not transferred by tractive