local-to-global reordering before raw dumping

dsidoren commented 1 year ago

https://github.com/ecmwf/multio/blob/d3d8781861debb35f8df19a8970475f7cc8ca61e/src/multio/api/multio_api.f90#L44

we did a Fortran example for writing the a field of unstructured representation (just a vector) in a raw format. For this we use N cores and define local-to-global index mapping as:

    !each processor contains 100 values of a global field
    buffer=[(i+myrank*100, i=1,100)]
    cerr = mio%write_domain(md, buffer, size(buffer))

Then we write a test vector field into a binary file:

    values = [(i+myrank*100-1, i=1,100)]
    cerr = mio%write_field(md, values, size(values))

It turns out that the global indexing order is not taken into account when the raw file is dumped. Namely the processors which first call the "write_field" will write the first chunks of data. It obviously changes the order of the raw file content from run to run.

We we expecting that local-to-global reordering will be made before dumping the data. Is there any option to force mio to do so?

Thank You!

suvarchal commented 1 year ago

elaborating more on the above issue and rationale in a hope to gather more attention ;).

We are trying to use multio in our unstructured grid ocean model, FESOM, to use it as a backend for IO, in particular to write grib.

the API test examples like multi-replay are focused on structured grid representation which we failed to easily adapt to out grid. So we needed a simple example that we can build upon for our unstructured model output. In simplest sense the global fields are 1d, so we made a simple prototype that will shed light on how to use multio in our model: chunk a global emulating parallel computation and write the output via multio as a single file. As we are also not experts of grib, it was logical to simplify to just write to raw binary.

here is snippet of our working prototype for 7 clients and 1 server that dumps a raw binary single or per server


       cerr = cc%mpi_client_id("oce")
       cerr = multio_initialise()
       cerr = mio%new(cc)

       cerr = mio%open_connections()    
       cerr = md%new()

       cerr = md%set_string_value("name", "Fesom Grid")
       !cerr = md%set_string_value("name", "any") ! also works
       cerr = md%set_string_value("category", "ocean-domain-map")
       cerr = md%set_string_value("representation", "unstructured")
       cerr = md%set_int_value("globalSize", 700 )
       cerr = md%set_bool_value("toAllServers", .TRUE._1)

       buffer=[(myrank*100+i-1, i=1,100)]

       cerr = mio%write_domain(md, buffer, size(buffer))

       !Writing fields

       cerr = md%new()

       cerr = md%set_string_value("category", "ocean-2d")

       cerr = md%set_int_value("globalSize", 700)
       cerr = md%set_int_value("level", 0)

       cerr = md%set_bool_value("toAllServers", .FALSE._1)

       cerr = md%set_string_value("name", "sst")
       cerr = md%set_string_value("param", "sst")
       cerr = md%set_int_value("step", 1)
       cerr = md%set_string_value("gridSubType", "Fesom Grid") ! also works without
       cerr = md%set_string_value("domain", "Fesom Grid") ! also works without
       cerr = md%set_string_value("typeOfLevel", "oceanSurface") ! also works without
       cerr = md%set_int_value("globalSize", 700 )
       values = [(myrank*100+i-1, i=1,100)]
       cerr = mio%write_field(md, values, size(values))

and corresponding server config

client:
  on-error: abort-all-transports
  plans:

    - name : ocean-replay-grid-info-stream
      actions :
        - type : select
          match :
           - category : [ocean-domain-map, ocean-mask]

       #- type: print #used to debug on screen

        - type : transport
          target : server

    - name : ocean-replay-test-stream1
      actions :
        - type : select
          match :
           - name: [ sst, ssv ]

        - type : transport
          target : server
server:
  transport : mpi
  #group: world
  plans :
    - name : ocean-fields
      actions :
        - type : select
          match :
           - category : [ocean-2d]

        - type : encode
          format : raw

        - type : sink
          sinks :
            - type : file
              #append : true
              #per-server : true
              path : ocean-output-field.bin # read in python as np.fromfile("..ocean-outpur-field.bin", np.double)

the expected output for 7 clients is 0-699 in that order but the output data is in random order on each run, as if server dumps data to a file as it receives from different processes, in general i would imagine there is a need to keep an association between rank and local indices of data mapped to global indices and i can't see it how this happens with metadata and data we sent in the example. what are we missing?

I can imagine many issues that may have gone wrong in example above in relation to working structed data example, not using communicators as intended, gathering works only for grib and not raw binary because write domain takes more data in structured case, and say in case of structured data lat-lon coordinates are used internally to sort the gathered data....

while we explore more, @dsarmany @simondsmart, it would be very helpful if you could shed some light on it. At least on how does multio server knows how to map local data to global indices are we missing any metadata for domain or fields.

dsarmany commented 1 year ago

Yes, you are right. With that configuration, the 'server dumps data to a file as it receives from different processes'.

If what you'd like to do is to create a global, aggregated field from the partial fields (sent from each client to a server), you will need to add an 'aggregate' action to your configuration file. That will use the local-to-global mapping information you communicate with cerr = mio%write_domain(md, buffer, size(buffer)) and create a global field where the values a placed according to that mapping.

If the test_multio_hammer_mpi test passes, it suggests that the aggregation action with unstructured mapping representation behaves as expected. Here is the server-side configuration for that test: https://github.com/ecmwf/multio/blob/develop/tests/server/config/test-configurations.yaml#L5 .

I appreciate that currently there is no fortran test that replicates this behaviour. It is something we may add.

Looking at your fortran code, I think it should work once you add the 'aggregate' action. The only comment is the you probably want to call md%delete() after both cerr = mio%write_domain(md, buffer, size(buffer)) and cerr = mio%write_field(md, values, size(values)).

suvarchal commented 1 year ago

@dsarmany thanks a lot! that was helpful.

After using aggregate action here comes the issue that i am feeling good about :D, my hunch that some domain mapping info is missing in my example

Exception: Cannot find domainMaps for Fesom grid
Exception: FailureAware<Action> with behaviour "propagate" on Server for context: [
Aggregation(for 1 fields = [
  --->  {"category":"ocean-2d","globalSize":700,"level":0,"toAllServers":false,"name":"sst","param":"sst","step":1,"domain":"Fesom grid","typeOfLevel":"oceanSurface"}]) with Message: Message(version=1, tag=Field, source=Peer(group=multio,id=3), destination=Peer(group=multio,id=7), metadata={"category":"ocean-2d","globalSize":700,"level":0,"toAllServers":false,"name":"sst","param":"sst","step":1,"domain":"Fesom grid","typeOfLevel":"oceanSurface"}, payload-size=800)
]

So how can we create a domain map for an unstructured grid, In case of relay example using nemo data it seems to me that domain mapping is automatically done by writing fields: lat_T, lon_T (for say domain T Grid) or is it in data of mio%write_domain? , can you please confirm that. Now wondering in case of unstructured what fields are needed for domain, indices seem most obvious thing but then what's the name to use for that.

on

I appreciate that currently there is no fortran test that replicates this behaviour. It is something we may

Just a comment, thanks, any test even using cpp/c using multio-api is appreciated because it is the api that we intend to use in the end. In case of example like hammer, it uses multio library directly and i find it hard to map all the direct library calls to that of api calls. For instance in https://github.com/ecmwf/multio/blob/45ff382cc798e6b185806485dbf60dc9b189ea01/src/multio/tools/multio-hammer.cc#L385 there is an index map for unstructured grid and i am not sure where that is used or how it is communicated to the server.

dsarmany commented 1 year ago

The exception tells you that the domain named 'Fesom Grid' has not been communicated to the server before a field associated with that domain has arrived for aggregation. Looking at the code you posted before, it is not immediately clear why that is happening. Did you change that to call delete on the metadata once it is passed to multio?

Apart from that, your code snippet seems correct at a glance. You need to call write-domain before you call write_field and make sure that the 'name' key of the domain has the same value as the 'domain' key for the field.

There is no need to worry about lon_T and lat_T. They are not related to the domain information. They are the coordinates output as fields and we need them in NEMOv4 for an unrelated reason. I don't think you will need them.

So I think your best bet at this stage is to create a small example (with much fewer than 700 points) that reproduces this behaviour. Then you can try running it with MULTIO_DEBUG=1. That will give you a lot more output but hopefully not so much to swamp you. Then we may be able to understand why the domain is not recognised.

dsidoren commented 1 year ago

Thank you, it works now with "- type : aggregation" and other modifications in yaml. The things we also tried are setting "step" and "level" in order to mimic the temporal 3D output. It works for "step" but the levels are now reordered in the written binary file.

  do j=1, 10
       .....
       cerr = md%set_int_value("level", j)
       cerr = mio%write_field(md, values, size(values)) ! we write each level individually
       .....           
  end do

As a result the first written level is level 4 then 5, 1, 7 etc. Is there an any easy way to dump them in the increasing/descending order?

Thank you!

dsarmany commented 1 year ago

The short answer is that no, unfortunately not, or at least not easily.

By default, MultIO treats every step and every level as a conceptually different field, so there is no attempt to preserve any order among steps and levels. That is mainly because the output is usually in GRIB format and the order of GRIB messages in a given a file makes no difference.

That said, MultIO also supports the computation of temporal statistics, in which case different steps of the same physical quantity need to arrive in order. That can be achieved and it is also no strict requirement as there is usually a natural lag between the computation of subsequent steps.

Currently there is no such mechanism for levels, but it is certainly something we can consider adding. What about raising a separate ticket for that as it does not seem to be related to local-to-global mappings anymore?

dsarmany commented 1 year ago

Currently there is no such mechanism for levels, but it is certainly something we can consider adding. What about raising a separate ticket for that as it does not seem to be related to local-to-global mappings anymore?

Actually, could you try modifying the transport action in your configuration file like this:

   - name : ocean-replay-test-stream1
      actions :
        - type : select
          match :
           - name: [ sst, ssv ]

        - type : transport
          target : server
          hash-keys: [category, name]

This may just do the trick. Fields with different levels should arrive in the order the model mio%write_fields them.

dsidoren commented 1 year ago

Thank you, that solved the problem. We finally managed to integrate multio into FESOM and it seems to work. We however noticed that it does not accept overlapping partitionings. As an example, running on N processes with the global size of the domain S multio requires that the sum of all subdomain sizes over N equals S. If this condition is not met the code runs without any error message but the resulting file is just empty. Changing the logistics inside FESOM (passing only "shrinked" arrays to multio such that "sum(s_i)=S") solved the problem. The question is if you plan to support overlapping subdomains in the future?

dsarmany commented 1 year ago

If you are using 'unstructured' domain indexing, then yes, MultIO currently assumes that there are no duplicate indices into the global domain (no overlapping partitioning). With overlapping subdomains, there will be duplicate values for some global indices, in which case MultIO will either have to know which is correct or have to assume it does not matter because they are all equal.

For 'structured' domain indexing, MultIO does in fact support overlapping subdomains, as long as it follows the convention described in Sections 5.2.2 and 5.2.3 from here: https://forge.ipsl.jussieu.fr/ioserver#:~:text=Documentation-,XIOS%20user%20guide,-XIOS%20reference%20guide .

geier1993 commented 7 months ago

Can this be closed? Seems like FESOM is running now. Don't know if the last comment about non overlapping subdomains for unstructured grids is still relevant.

Could be a feature request, if that is required in the long run.

ecmwf / multio

local-to-global reordering before raw dumping #9