Include schematic diagrams to help explain the text

davidhassell commented 3 years ago

@sadielbartholomew, whilst reviewing PR #20, had the excellent idea of one or more diagrams that helps explain the text:

From https://github.com/NCAS-CMS/cfa-conventions/pull/20#pullrequestreview-691015428:

One general comment in the meantime is that it would be nice to have at least one schematic in this standalone canonical document, to provide a visual overview to accompany the text. Of course there shouldn't be one just for the sake of it, but now I think is a good time to look over the text and check if there is anything that might work well for an outline or otherwise useful diagram.

Perhaps a simple high-level diagram showing the relationships and/or roles of the main concepts defined in the 'Aggregation variables' section (the variables themselves, aggregated data, aggregated dimensions, fragments and their dimensions, etc.)? And/or something that emphasises that CFA sits alongside CF and the relation to netCDF data and metadata in each case?

This was discussed briefly in PR #20 but then moved to this issue so that it didn't hold up that PR being accepted.

davidhassell commented 3 years ago

Initial, not-too deeply thought-about UML suggestions:

1)

2)

nmassey001 commented 3 years ago

cfa_class_diagram

I think we can just use numpy.dtype for the Datatype, can't we? Or do we need something more?

davidhassell commented 2 years ago

Hi,

Coming at it from an encoding-independent view point, how about (source code at end):

cfa_data_model_2021-08-10

I think something along these lines is what we want for this document, and then Neil's UML would be the data model of the implementation, rather than the pared-down logical connections.

Not 100% convinced about my arrow, heads, and tails, as ever!

Thanks, David

# ====================================================================
# Source code. Create with:
#
# $ dot -T png file.gv -o file.png
# ====================================================================

digraph {splines=ortho nodesep="+0.25"

node [
     style="filled,bold"
     shape=rectangle
     fillcolor="#FFA533"
     width=1.5
     height=0.7
     fontname="Arial"
     ]

# --------------------------------------------------------------------
# CF data model constructs
# --------------------------------------------------------------------
AggregationVariable [
       label="AggregationVariable"
       ]

AggregationInstructions [                     
      label="AggregationInstructions"
      ]

AggregatedDimension [
       label="AggregatedDimension"
       ]
FragmentDimension [
       label="FragmentDimension"
       ]
Fragment [
        label="Fragment"
        ]

AggregatedData [
        label="AggregatedData"
        ]   

edge [dir=both
      arrowsize=1.0
      fontname="Arial"
      labelfontsize=11.0
      ]

AggregationVariable -> AggregationInstructions [arrowhead=diamond arrowtail=vee]

AggregationVariable -> AggregatedDimension [arrowhead=odiamond arrowtail=vee taillabel="0..*   "]

{rank=same; AggregationInstructions, AggregatedDimension}

{rank=same; FragmentDimension, Fragment}

AggregationInstructions -> Fragment [arrowhead=odiamond arrowtail=vee taillabel="0..*  "]

FragmentDimension -> Fragment [arrowhead=none arrowtail=vee taillabel="  0..*  " ]

FragmentDimension -> AggregatedDimension [arrowhead=vee arrowtail=none]

AggregatedData -> AggregationInstructions [arrowhead=odiamond arrowtail=vee]

AggregationVariable -> AggregatedData [arrowhead=diamond arrowtail=vee]

}

davidhassell commented 2 years ago

... also haven't worked out yet if "AggregationInstructions" is a logical entity, or not ...

nmassey001 commented 2 years ago

I’ve compared this to my UML, and it does seem like a distillation of what I have, plus the “Aggregation Instructions”. Which is good, as it shows we have a similar idea as to what the classes should be! :)

I think we can try to make the pared-down logical connections and the data model as close as possible. I'd like the data model to be a superset of the pared-down model, rather than distinct from it. I think, from your diagram, that we can work toward that.

We can try to think through what the "AggregationInstructions" mean, and what form they should take. I'll have a look through the document again and have a think.

nmassey001 commented 2 years ago

Okay, it took a bit of thinking (although I am slow in my between holidays week!), but I'm happy with this: cfa_class_diagram_new

I think it works nicely separating the AggregationInstructions and AggregatedData, from a parsing point of view.

sadielbartholomew commented 2 years ago

I've been quiet here whilst the diagrams were being formulated, but thought I would jump in at this point to say that both the data model UML (from Neil) and the pared-down logical connection schematic (from David) are looking like very useful and clear condensations of the concepts and to agree with Neil:

I’ve compared this to my UML, and it does seem like a distillation of what I have, plus the “Aggregation Instructions”

the two (current) diagrams seem consistent, also, as far as I can tell.

So it would be great to get both diagrams included in the document as soon as you are both happy with them and Bryan has looked over them and is also satisfied. Great stuff!

davidhassell commented 2 years ago

Hi Neil,

Thanks. This is getting interesting! I don't think we're quite there, yet, though ....

I think we can try to make the pared-down logical connections and the data model as close as possible.

This is where we differ. I think the pared-down logical connection view is the CFA data model. The data model should be the starting point of any software implementation, and allow for different encoding of CFA datasets.

Be assured, I'm absolutely not claiming my UML is already all there! I found it useful to compare the difference between the two views (all this is written in good spirits and reflects my current thinking, which is certainly plastic!):

There is a fundamental difference between the the two in that in the logical model the Fragment can exist without reference to the AggregationVariable, but that is not the use in the implementation model. The fact that a fragment exists without requiring anything associated with the AggregationVariable is, I think, key to these conventions. These differences manifest themselves in the implementation data model as:

The address, format, etc. components are placed in Fragment (I think they are only components of the AggregationInstructions)
A Fragment is an aggregation of FragmentDimension. (I think it is merely associated.)

Also:

Surely the AggregationVariable need to be composed of the AggregationInstructions, as well?
I'm not sure that the AggregatedData can exist independently of AggregatedVariable. In a normal netCDF variable, the variable is composed of its data, so when the AggregatedData is created it seems right that the same connection should apply in our model
Similarly, perhaps DataType is only a feature of the AggregationVariable and a Fragment, and not the AggregatedData

More generally, I don't think we should replicate elements of the netCDF data model in ours, such as "name" and "size" of a dimension. It is good to say, for instance, that in the netCDF encoding a FragmentDimension corresponds to a netCDF Dimension, but we don't need to (and shouldn't) hard wire in the netCDF encoding to the data.

Cheers, David

nmassey001 commented 2 years ago

This is where we differ. I think the pared-down logical connection view is the CFA data model. The data model should be the starting point of any software implementation, and allow for different encoding of CFA datasets.

I think we agree, and I just worded it badly! :) The implementation model should be a specialisation of the data model.

I'm still getting my head around composition vs aggregation. Can I think of it as: in composition the object contains the other object (in a list or as a variable, for example) and in aggregation, the object contains a reference to the other object?

Surely the AggregationVariable need to be composed of the AggregationInstructions, as well

Returning to this today: absolutely!

I'm not sure that the AggregatedData can exist independently of AggregatedVariable. In a normal netCDF variable, the variable is composed of its data, so when the AggregatedData is created it seems right that the same connection should apply in our model

Yes, confusion about composition and aggregation.

Similarly, perhaps DataType is only a feature of the AggregationVariable and a Fragment, and not the AggregatedData

I think it could be either, but I'm happy to move it.

nmassey001 commented 2 years ago

cfa_class_diagram_new

nmassey001 commented 2 years ago

PlantUML source:

@startuml

class DataType {
}

class Fragment {
    +int location
    +string file
    +string format
    +string address
    +string units
}

class AggregatedData {
    +string units
}

class AggregationInstructions {
    +string location
    +string file
    +string format
    +string address
}

class AggregatedDimension {
}

class FragmentDimension {
}

class AggregationVariable {
    +string name
}

AggregationVariable "1" o--> "0..*" AggregatedDimension
AggregationVariable "1" *--> "1" AggregatedData
AggregatedData "1" *--> "0..*" Fragment
Fragment "1" o--> "0..*" FragmentDimension
AggregatedDimension "1" o--o "1" FragmentDimension : ordered
AggregationVariable "1" *--> "1" AggregationInstructions
AggregationVariable "1" o--> "1" DataType
Fragment "1" o--> "1" DataType

@enduml

bnlawrence commented 2 years ago

I think some of the confusion between the views and the composition/aggregation is around the difference between the Fragment as a variable in the CFA definition which defines something about a Fragment which is a file containing that data. Since in most cases the fragment (file) contains only fragment (data) which is pointed to by the fragment (variable in the CFA master file) ... we can and do get lazy about which is which. Can we come up with a clearer nomenclature for these three usages?

davidhassell commented 2 years ago

Hi Bryan,

Could you elaborate on what you mean by "a variable in the CFA definition"?

The CFA Fragment is "An independent, possibly self-describing, array that defines a contiguous part of the aggregated data. The aggregated data is composed from a multi-dimensional orthogonal array of fragments." (https://github.com/NCAS-CMS/cfa-conventions/blob/master/source/cfa.md#Terminology). Whether or not a Fragment is a variable in the CFA-netCDF file, or is a somehow stored in another file (with or without other data) is neither here nor there.

Apologies if I've not sensed the point of your post!

bnlawrence commented 2 years ago

I think the sense of my point is that whether something is composed or aggregated depends on whether one is thinking about it as "an array" and "part of something described inside the current scope" (e.g the CFA Fragment usage) or the thing that is pointed to in the content (attributes) of that array. So we have CFA Fragments and Fragments ... the former is composed and the latter is aggregated ... I think.

So from a UML point of view, is the UML describing the information model held in the file, or is it the information model for the things described by the file?

davidhassell commented 2 years ago

So from a UML point of view, is the UML describing the information model held in the file, or is it the information model for the things described by the file?

For me, it should be the latter

sadielbartholomew commented 1 year ago

Hi @davidhassell, @nmassey001, @bnlawrence: please can we revive this? At this point we have a stand-alone v0.6 Conventions document with several examples outlined given a comprehensive overview, but ultimately the entire document is still pure text, which is quite intimidating. For that reason, and because we have some definite ideas fleshed out here already that may be ready for use or near enough, we should try to add in a schematic or two soon, I think.

Both Neil and David's ideas, as covered above, look really useful. From my reading of the above thread, the latest formulation of concrete ideas are David's diagram as covered in https://github.com/NCAS-CMS/cfa-conventions/issues/21#issuecomment-895910161 and Neil's diagram as covered in https://github.com/NCAS-CMS/cfa-conventions/issues/21#issuecomment-897732276 (image) and https://github.com/NCAS-CMS/cfa-conventions/issues/21#issuecomment-897735752 (source) and we are at the following state of agreement/review with regards to each: generally finding both diagrams consistent, with a few requested tweaks perhaps to one or both. But, as a side issue perhaps, Bryan (again as I understand it, though I could be misinterpreting) wants (see https://github.com/NCAS-CMS/cfa-conventions/issues/21#issuecomment-901012400) terminology to be made clearer with regards to fragments (potentially in the text, not just the diagrams?):

Since in most cases the fragment (file) contains only fragment (data) which is pointed to by the fragment (variable in the CFA master file) ... we can and do get lazy about which is which. Can we come up with a clearer nomenclature for these three usages?

So as far as I can see, to go forwards we need to agree on:

how to address Bryan's concerns as above, and;
agree upon the final versions of the two diagrams taking the former into account;
also, since this issue went quiet about 1.5 years ago, we need to check that nothing has changed with regards to CFA that might need to be reflected in the diagrams.

And then we can put in a PR (or two, if the first issue is a wider aspect rather than a sub-issue of this relating to the diagrams, which isn't clear to me from the above comments).

NCAS-CMS / cfa-conventions

Include schematic diagrams to help explain the text #21