Relative path names for fragments

davidhassell commented 2 years ago

CFA needs a way of indicating that the path names of the fragments are relative to "somewhere", as opposed to absolute path names.

This "somewhere" could be, amongst other things, the path of the directory containing the CFA file. This would satisfy the use case of a CFA file residing in the same directory as all of its fragments. When that directory is moved or renamed, the CFA file does not break if it is understood that the paths are relative to where the CFA file is.

It may also be desirable to store a common path for the fragments in another attribute to save space in the file variable.

This was possible in CFA-0.4 (see the base option on page 22 of https://github.com/NCAS-CMS/cfa-conventions/blob/v0.4/cfa.pdf), but is not currently available in CFA-0.6.1.

JonathanGregory commented 2 years ago

Dear David

In what situation would you need to store the fragments in more than one directory, or in a different directory from the CFA file?

Best wishes

Jonathan

davidhassell commented 2 years ago

Dear Jonathan,

A couple of use cases come to mind:

Perhaps if you are creating a CFA file for data that lives in locations for which you don't write permission;
or perhaps you are making a single CFA file from outputs from multiple models which reside in different source directories.

In any event, I think that the user needs some control over the relationship between the path of the CFA file and the paths of the fragments, recognising that at some point in some workflows the absolute paths of some or all of these files (the fragments and the CFA file) may change. Such a change does not need to result in broken CFA files if the rules governing relative path interpretations is made clear.

Hope that makes sense, all the best, David

JonathanGregory commented 2 years ago

Dear David

Yes, those are good use-cases. With pph files (similar in principle) in cases like those, I have proceeded by making a new single directory with symlinks to all the relevant data files, and then compiling the pph file (like cfa) in that directory using its apparent contents. This avoids the need for directory names in the paths to data files.

Best wishes

Jonathan

davidhassell commented 2 years ago

Thanks, Jonathan. I see that creating a new directory with symlinks would certainly be possible with CFA. This example brings to mind another issue - if the CFA file is in the same directory as its data then a read function like cf.read reading the whole directory would, by default, read two copies of the data - one from the original netCDF and one from the CFA file.

>>> import cf
>>> f = cf.read('~/test/')
>>> f 
[<CF Field: air_temperature(time(12), latitude(64), longitude(128)) K>]
>>> cf.write(f, ~/test/test.nca', fmt='CFA4')
>>> f = cf.read('~/test/')
>>> f
[<CF Field: air_temperature(time(12), latitude(64), longitude(128)) K>,
 <CF Field: air_temperature(time(12), latitude(64), longitude(128)) K>]

You can filter by type, though:

>>> f = cf.read('~/test/', fmt='CFA')
>>> f
[<CF Field: air_temperature(time(12), latitude(64), longitude(128)) K>]

This not necessarily wrong, but something to bear in mind.

JonathanGregory commented 2 years ago

Dear David

That's a good point. If there is a cfa file in the directory, perhaps it would be logical to read only that file?

Best wishes

Jonathan

davidhassell commented 2 years ago

@JonathanGregory wrote:

That's a good point. If there is a cfa file in the directory, perhaps it would be logical to read only that file?

I'll raise this as an issue that can be discussed over at https://github.com/NCAS-CMS/cf-python

davidhassell commented 2 years ago

Having discussed this here with @JonathanGregory and off-line with @sadielbartholomew, I'd like to propose some possible ways that this issue can be solved.

A fragment URI could interpreted in various ways:

an absolute path
relative to the directory containing the CFA file
relative to the $CWD of the application reading the CFA file
any more?

I think there are identifiable use cases for 1. (for when the fragments don't move, but the CFA file does) and 2. (for when the fragments and the CFA file all move together. For case 2., I'm not discounting the fragments being in different location to the CFA field, as long as after they've moved they're still in the same relative locations. E.g. the CFA file could be in the directory above the fragments' directory.

I can't think of a use case for 3. but included it for completeness, and suggest that it is not allowed by CFA, as only bad things can come of it.

So, how to discern when case 1. or case 2. is in use? I currently think that leaving this to inspection of the fragment URIs is too fragile, and so it must be indicated by a netCDF attribute. I propose putting this attribute on the file variable actually contains the URIs

     string aggregation_file(f_time, f_level, f_latitude, f_longitude) ;
          aggregation_file:path_type = "<absolute/relative>"  // placeholder attribute name/value!!!!

data:
      aggregation_file = "January-June.nc", "July-December.nc" ;

I would suggest that the attribute is optional and defaults to absolute paths.

These are just some initial thoughts - please say if can see concerns that I've missed, or you think other approaches could be better.

Thanks, David

JonathanGregory commented 2 years ago

Dear David

Is it fragile to assume that a string starting with / is an absolute path, and anything else a relative path?

Best wishes

Jonathan

davidhassell commented 2 years ago

Dear Jonathan,

Apologies for the half year hiatus in this conversation. I've started using @nmassey001's CFA-Python and been thinking about CFA-0.6 in cf-python, so this has come back to my attention.

I'm not sure about your suggestion of the presence, or otherwise, of a leading /, as isn't that OS dependent? E.g. that wouldn't give meaningful path on a Windows machine.

Could portability be an issue? What about when a CFA-netCDF files is created on a windows machine and then shipped to a *unix machine, assuming that the fragment files are available to both?

Here's an idea that may solve all of these problems (?!): How about fragment file names are assumed to be relative to the location of the parent CFA-netCDF file unless they are given with a URL or a file URI, in which case they are necessarily absolute?

Thanks, David

bnlawrence commented 2 years ago

I'd like to return to: A fragment URI could interpreted in various ways:

an absolute path
relative to the directory containing the CFA file
relative to the $CWD of the application reading the CFA file
any more?

I think we have a consensus is that we think (2) is the default, that (1) is acceptable (but agree with David that we shouold require a full URI rather than make OS specific guesses), and though no one else has commented, I agree with the assertion that (3) is dangerous and shouldn't be allowed.

I would like to add a (4), to support life cycle management. My use case is this: I have just run a model simulation, and generated the aggregation file. I would now like to keep the aggregation file, but move the data to tape. I do not want to update every fragment location. Further, I might want to now bring some of the fragments back from tape, but not all of them.

I propose that we support the following "bash-like" substitution pattern as well:

If a location is ${BASE}/directory/file a valid CFA file must have an attribute BASE, and the value of that must be the beginning of a fully qualified URL, such that the expanded location is a URL. This would allow easy updating of the aggregation file when directory is moved to tape, and if say (half) the aggregation files are brought back, they could then have two locations, and the updated location could simply be a copy of the location, but with a different BASE, eg. $(BASE2)/directory/file would be a replicant in a different location. This would considerably simplify aggregation file updates and data management in general.

davidhassell commented 2 years ago

Thanks, @bnlawrence. To summarise, at least as I now see it:

Case 1: Fragment file names are absolute paths. We know this because the file names are fully qualified URIs.

     string aggregation_file(f_time, f_level, f_latitude, f_longitude) ;
data:
     aggregation_file = " file:/a/path/January-June.nc", " file:/another/path/July-December.nc" ;

Case 2: Fragment file names are relative to the path of the parent CFA file. We know this because the file names are not fully qualified URIs:

     string aggregation_file(f_time, f_level, f_latitude, f_longitude) ;
data:
     aggregation_file = "January-June.nc", "July-December.nc" ;

Case 3: Never to be spoken of again :)

Case 4: This case is basically the same as case 1 or 2 but with possibility of having zero or more parts of the file names replaced by string substitutions defined by the substitutions attribute, that uses the typical "name: value" approach that is common throughout CF. The substitutions are indicated via the shell syntax of ${...}. We won't know if we have a case 1 or 2 situation until the substitutions have been implemented, at which point we can see if it is a fully qualified URI, or not.

     string aggregation_file(f_time, f_level, f_latitude, f_longitude) ;
         aggregation_file:substitutions = "BASE: file:/a/path/ BASE2: file:/another/path/" ; 
data:
     aggregation_file = "${BASE}January-June.nc", "${BASE2}July-December.nc" ;  // absolute

     string aggregation_file(f_time, f_level, f_latitude, f_longitude) ;
         aggregation_file:substitutions = "SUBDIR: subdir/" ; 
data:
     aggregation_file = "${SUBDIR}January-June.nc", "${SUBDIR}July-December.nc" ;  // relative

These three approaches can be intermingled, with no ambiguity, as each file name is interpreted independently from the others:

     string aggregation_file(f_time, f_level, f_latitude, f_longitude) ;
         aggregation_file:substitutions = "BASE: file:/a/path/ BASE2: file:/another/path/"; 
data:
     aggregation_file = "${BASE}January-June.nc", "file:/another/path/July-December.nc, "file.nc" ;

davidhassell commented 1 year ago

Hi @JonathanGregory and @bnlawrence,

I'm revisiting the open CFA issues, and wondered if you had any comments on the latest suggestion here (https://github.com/NCAS-CMS/cfa-conventions/issues/36#issuecomment-1287084491)?

It still looks good to me, in that the default when you don't specify any substitutions is sensible, and the flexibility for fancy data management is catered for.

Thanks, David

JonathanGregory commented 1 year ago

Dear @davidhassell

Yes, I think that would be fine.

Can one use the shell syntax without {} as well, which is fine and normal if followed by punctuation e.g. $BASE/January-June.nc?
Is parameter expansion done recursively? Probably not, but if not it might be helpful to say it's not.
If $NAME is found to be undefined by the substitutions attribute, can we try expanding it as a shell env var? That would allow an extra flexibility. One could use the same relative hierarchy for datasets but with a different base directory on various hosts without any change to the file.

Best wishes

Jonathan

davidhassell commented 1 year ago

Thanks, Jonathan. I shall think about these interesting points you raise ...

davidhassell commented 1 year ago

Dear @JonathanGregory

Can one use the shell syntax without {} as well, which is fine and normal if followed by punctuation e.g. $BASE/January-June.nc?

I don't think so, as it means we would have to be explicit about directory separators in any OS. For example, a . probably not count as punctuation.

Is parameter expansion done recursively? Probably not, but if not it might be helpful to say it's not.

I would agree that recursive expansion should not be allowed.

If $NAME is found to be undefined by the substitutions attribute, can we try expanding it as a shell env var? That would allow an extra flexibility. One could use the same relative hierarchy for datasets but with a different base directory on various hosts without any change to the file.

I quite like this idea, but am a bit worried about inapprorpriate expansions happening without the user's knowledge. @bnlawrence - what do you think?

davidhassell commented 1 year ago

To move things on, I have created PR #44 to address this. This is to focus out minds, not a done deal, of course.

Notes:

The new example is "1c", acknowledging the new example pending in #42
I opted for the definition of the substitution to also include the ${...} syntax. I felt that this made it easier to describe, and also make the file more readable by humans - the substitution environment variable (for want of a better word) appears in the URI exactly as it is defined by the substitution attribute.

davidhassell commented 1 year ago

The new text says

The use of substitutions can save space in the file and, if the
fragment files were to be moved, provides a means of updating the
CFA-netCDF file without having to access the `file` variable's data.

The updating but is true, but another use case is to read the file without applying substitutions that are known to be wrong, and then get your software to apply new, correct substitutions.

For example, if the CFA-netCDF file contained:

     string aggregation_file(f_latitude, f_longitude) ;
         aggregation_file:substitutions = "${BASE}: file:///a/path/";

     data:
         aggregation_file = "${BASE}January-June.nc" ;

but we knew that the data had been moved from file:///a/path/ then we might want to read the file name as a literal "${BASE}January-June.nc" and then apply a new substitution (e.g. ${BASE} = file:///totally/different/path/") in memory.

To reflect this I propose replacing the sentence given at the start of this comment with:

The use of substitutions can save space in the file and, in the event
that the fragment files have moved from their original locations, the
creation of the aggregated data can be facilitated by changing the
substitutions rather than the URI strings given by the `file`
variable.

NCAS-CMS / cfa-conventions

Relative path names for fragments #36