A convention for complex numbers?

shoyer commented 5 years ago

This may not come up often for Climate and Forecast use-cases, but many physical scientists are interested in storing arrays of complex numbers. This has come up with some regularity for xarray users, e.g., see https://github.com/pydata/xarray/issues/3297 for the most recent issue.

There does not appear to be a standard for how to store complex numbers in netCDF files, so scientists are currently presented with two poor options:

Use their own de-facto convention, unsupported by software, or
Rely on non-standard extensions of netCDF, e.g., invalid_netcdf=True with h5netcdf: https://github.com/shoyer/h5netcdf#invalid-netcdf-files

I would love to resolve this with a standard way to store complex values in netCDF file, so xarray doesn't have to invent its own standard or encourage writing HDF5 files that aren't valid in netCDF.

Some questions:

Does it make sense to put this in CF-Conventions (the de-facto standard for how to write netCDF files), even though there do not appear to be climate/forecast use-cases for this?
What should this look like? I can think of three reasonable approaches: a. Use compound data types in a single array, like h5py and HDF5.jl (see https://github.com/JuliaIO/HDF5.jl/pull/558). Advantages: this is an existing standard already in use. Disadvantages: requires the HDF5 data model; netCDF-C lacks support for reading some types compound data types (https://github.com/Unidata/netcdf-c/issues/267), including the convention used by h5py. b. Store real and imaginary parts in separate arrays, with some sort of metadata convention for indicating that they are the same logical array. Advantages: easy to interpret merely by inspection; easy to access real or imaginary parts separately. Disadvantages: a new convention; reading data into complex dtypes in memory will likely require an additional copy to combine real/imaginary parts. c. Store real and imaginary parts in a single array with an extra dimension of length 2 at the end for real and imaginary parts. Advantages: still backwards compatible with old netCDF formats; can be memory mapped without a copy. Disadvantages: slightly less self-explanatory (user needs to understand the dimension mapping); adds an extra dangling dimension of length two in the dataset.

DocOtak commented 5 years ago

Re question 1: I think CF is a good place for this, however... the inertia you will encounter for getting any of these added is probably going to be large. CF is very slowly adopting netcdf4 features, e.g. strings were added just today.

When looking though the mailing list, proposals for standard names take the form "imaginary_part_of_fourier_transform_of_air_pressure_wrt_time" (and corresponding "real_part" names). None of these have made it into the actual standard name list. These would be using option 2b above.

The netcdf4-python docs use option 2a above as an example (compound types): https://unidata.github.io/netcdf4-python/netCDF4/index.html#section10 This method seems to be the way that the library authors think you should store data like this.

Option 2b seemed to be the one favored in a mailing list 2010 discussion, but there was no conclusion. This was about 2 years after netcdf4 was introduced and option 2a was "unavailable" or at least too new. http://mailman.cgd.ucar.edu/pipermail/cf-metadata/2010/003517.html

Option 2c is what the CFRadial folks have proposed with an attribute which indicates that the last dim is two reals representing a complex. https://cf-trac.llnl.gov/trac/ticket/169

shoyer commented 5 years ago

It's nice to see that somebody has tried each of these approaches before!

With a compound dtype, I can use netCDF4-Python to create complex values that can be read by h5py:

import netCDF4
import numpy
import h5py

size = 3
datac = numpy.exp(1j*(1.+numpy.linspace(0, numpy.pi, size)))

with netCDF4.Dataset("complex.nc","w") as f:
  complex128 = numpy.dtype([("r",numpy.float64), ("i",numpy.float64)])
  complex128_t = f.createCompoundType(complex128, "complex128")
  x_dim = f.createDimension("x_dim", None)
  v = f.createVariable("var", complex128_t, "x_dim")
  data = numpy.empty(size,complex128)
  data["r"] = datac.real
  data["i"] = datac.imag
  v[:] = data

with h5py.File('complex.nc') as f:
  print(f['var'][...])

But unfortunately it does not seem to be possible to do it the other way around -- netCDF-C can't read the sort of custom dtypes that h5py creates. This could presumably be fixed either in netCDF-C or in h5py.

I think my favored solution would be option 2c -- the extra dimensions for real/imag parts feels like the most efficient/compact solution. It's backwards compatible with netCDF3 (still useful for many purposes!) and also with newer format like zarr (though zarr also supports complex values directly). From the perspective of a tool like xarray it avoids the issue of needing to combine attributes from two different variables into a single variable.

JimBiardCics commented 5 years ago

I agree that option 2c is likely the best to maximize compatibility across different version of netCDF, etc. Splitting an imaginary number into two variables is (to me) counterintuitive and counterproductive. An attribute, such as the CF-Radial is_complex attribute, is probably needed to make it clear that a variable contains complex numbers. The CF-Radial proposal also points out the possibility that the complex number could take a radial/exponential form. I think this could be best handled by specifying the form in the is_complex attribute (or whatever name was chosen). The value could be one of 'cartesian' or 'radial'. We would also need to resolve the units issue for the radial/exponential case, either by specifying the units for the angular argument explicitly in the convention, by adopting the CF-Radial proposal for a comma-separated pair of units, or by some other method.

shoyer commented 5 years ago

What are the use cases for storing complex numbers in polar form? I guess it allows you to load half the data from disk, if you only need one of the terms?

From a simplicity perspective, splitting along the last dimension into real and imaginary components feels the cleanest to me -- these map directly into numeric types.

This avoids the issue of specifying multiple units in a single string, which feels a little strange to me. If you need to keep track of different units for the magnitude and phase, then I think multiple variables would be appropriate.

DocOtak commented 5 years ago

With the caveat that none of my work involves complex numbers: Is it valid to say the these two options, either cartesian or radial ('eulerian'?), are actually "encodings" of a "single" complex number?

I ask this because if this is an "encoding" then perhaps the units wouldn't really apply until after you do the "decode" step. The representation of the values "at rest" aren't as connected as the original CF Radial proposal would imply. Similar to how add_offset and scale_factor work.

JonathanGregory commented 5 years ago

This issue is certainly of interest in the CF community and has been discussed before, most recently by CF-Radial, as mentioned. I favour option 2b, because

It resembles how we deal with physical vectors, like velocities, in which each component is a separate variable. This is a common and familiar practice, not only in CF data but in other formats for climate and forecast data. Of course it has the drawback that the association between the components is not immediately obvious, and there have also been discussions about adding metadata to record the association. On the other hand, it's also an advantage that you can easily access the components separately, and put them in separate files without changing anything.
It works in the netCDF classic model.
It avoids the problem of having quantities with different units sharing a netCDF variable, which arises with the radial expression of vectors, and could arise with Cartesian components too if they were of different orders of magnitude. I think that changing the units attribute to something which is associated with a dimension does not fit well with netCDF or CF conventions. The existing CF convention is that a quantity must be in a different variable if it has a different unit, because it must have a different standard_name as well in that case.

Whether we should add complex numbers to CF in some way also depends on their being a use-case which is strong enough for CF to adopt, since we don't add things to CF solely because we can imagine the possible need for them.

shoyer commented 5 years ago

As a maintainer of domain agnostic software that interprets a subset of CF conventions, I would like to have a standard way to store arrays of complex values in a single netCDF file, that doesn't require adopting all of CF to be useful.

In xarray's data model, we can already represent multi-dimensional labeled arrays of complex values as a single variable, and these are what users want to store to netCDF. Most of these users don't particularly care about CF conventions (e.g., they are physicists not climate scientists), but I'd like them to be able to store "more standard" data to disk, so it has a better chance of being readable by other software.

So if we use multiple variables for complex values, I'd like to standards for:

How the separate components should be named, and what the combined variable should be named, e.g., imaginary_part_of_X and real_part_of_X correspond to the single logical variable X. This should not require adopting full standard names like imaginary_part_of_fourier_transform_of_air_pressure_wrt_time.
How attributes can be merged between different variables. How should software handle the inevitable situation where the real and imaginary part have different attributes, e.g., conflicting units? I don't see how domain agnostic software like xarray could handle scenarios like this.

The advantage of approach 2(c) is that it side-steps all of these issues, because it's impossible to use different metadata for the different components.

I'm sure there are valid use cases for using different units for real/imaginary components or storing data in polar form. I agree that for these use cases, it makes sense to store data in separate arrays, and I would encourage creating domain specific conventions for how to store particular sets of variables. But these feel like a different use case: data that could be represented as complex values, not data that always is complex valued.

dopplershift commented 5 years ago

Another point in favor of 2c, IMO, is that one can readily read in the data from a netCDF file and create a view of this data using a client language's native support for complex values with no copies.

shoyer commented 5 years ago

Ping @mike-dixon and @piyushrpt, who were involved in the discussion on the CF-Trac website.

piyushrpt commented 5 years ago

Just wanted to chime in. While waiting for resolution on this, the following features have been added to GDAL primarily with SAR geospatial datasets in mind:

NETCDF4 https://github.com/OSGeo/gdal/blob/master/gdal/frmts/netcdf/netcdfdataset.cpp#L287-L341

HDF5 https://github.com/OSGeo/gdal/blob/master/gdal/frmts/hdf5/hdf5dataset.cpp#L208-L263

These implement approach 2a. It allows us to work well with native numpy types in python as well as with complex values support in C++. We have been focusing more on using CF conventions with HDF5 as it allows us to float16 for our applications.

bnlawrence commented 1 year ago

I just had my attention drawn to this. By way of trying to move this along, a bit of a summary and restatement.

The original suggestion from @shoyer was that there were three options: a) compound data types, b) real and imaginary parts in separate arrays, and c) real and imaginary parts in a multi-dimensional array.

This was effectively because he had correctly rejected underlying library support (netCDF doesn't do it, effectively because HDF doesn't do it, yet).

A couple of quick comments

Obviously it would be better to rely on underlying library support, but while it looks like Unidata has decided this is an HDF problem (for very good reasons), and HDF has a recent feature request covering the same ground, I suspect I'll be retired before that happens (I was amused by Dave Allured's comment on the relevant unidata netcdf discussion that HDF has been struggling with this for decades ("we definitely do not want to wait until 2012").
So clearly it's not too late for us to do something.

Jonathan has said we need a use case, and cited cf-radial. Maybe we need to revisit that? But in any case, this feels like a real chicken and egg issue: we don't have a use case for complex numbers because we don't have support for them ...

So, revisiting the options,

option a) would probably be the most graceful, but involves a new data type, which is precisely what netCDF decided not to do themselves. One could construct an argument that we shouldn't for the same reasons, or conversely, that we should do it, precisely because we can. This option was originally rejected because netCDF didn't have support for compound data types - I think it does now (though someone more expert than me might comment).
option (b) involves new metadata and possible performance issues. It has been suggested that this would be similar to how we handle vector components (e.g. winds), but personally I think that's a stretch, we often use wind components, it would be less often that one woudl use part of a complex number without wanting the rest of it.
option (c) has considerable support, not only here but in the native netcdf discussion, it's "metadata clean" and likely more performant than (b).

My take on the discussion so far is that there is agreeement we should do this, and more support for (c) than the other options. We have a CF meeting in a couple of weeks, can we knock this one off there?

ZedThree commented 1 year ago

I'm a research software engineer from the plasma science community where we regularly use complex numbers, and I've seen all three conventions in use in production research software. I've also been trying to push some kind of solution forward in the last few months.

I strongly support option a, using a compound datatype. This is supported in the netCDF4 API, released in 2008, so should be widely available! Option b, split variables, and option c, a new dimension, are very unlikely to make it into either netcdf or HDF5 at the library level. The h5netcdf implementation, which uses h5py to write netcdf-compliant HDF5 files directly, also uses a compound datatype $^1$. The major advantage of using a compound datatype over a dimension is that this maps more easily to language-native representations without having to do (potentially complicated) type-casting to get an array with an extra dimension -- this is particularly evident in Fortran where it can't be done without copying or going through C.

However, the downside of picking any convention at the library level is that there will still be plenty of existing files (and software) using the others, as well as variations on the chosen format (mostly different names for the real/imaginary components).

This has been bothering me so much I've started work on a (proof-of-concept for now) drop-in extension to netCDF with C, C++, Fortran, and Python APIs. The idea is to handle all possible conventions for reading (and eventually appending) so that users and developers don't have to care about which one a particular file is using. It also writes native complex numbers to compound datatypes, checking if the type already exists in the file and creating it if not. The plan is to support reading as many existing representations as possible in order to make it completely painless for an application to switch convention.

h5py writes complex numbers using a compound datatype but doesn't commit this to the file, so the datatype lives in the variable itself. I recently added a feature to netCDF to be able to read these unnamed datatypes, and so from the next release there should be better compatibility between h5netcdf and the main netCDF implementation.

JonathanGregory commented 1 year ago

A compound data type would be OK for CF metadata if it was real and imaginary parts, but not if it was polar, because the two components have different units, and would need different standard names.

bnlawrence commented 1 year ago

A compound data type would be OK for CF metadata if it was real and imaginary parts, but not if it was polar, because the two components have different units, and would need different standard names.

Yes, I had meant to reference the original trac ticket where this part of the problem was originally discussed.

bnlawrence commented 1 year ago

I strongly support option a, using a compound datatype. This is supported in the netCDF4 API, released in 2008, so should be widely available! Option b, split variables, and option c, a new dimension, are very unlikely to make it into either netcdf or HDF5 at the library level. The h5netcdf implementation, which uses h5py to write netcdf-compliant HDF5 files directly, also uses a compound datatype 1. The major advantage of using a compound datatype over a dimension is that this maps more easily to language-native representations without having to do (potentially complicated) type-casting to get an array with an extra dimension -- this is particularly evident in Fortran where it can't be done without copying or going through C.

However, the downside of picking any convention at the library level is that there will still be plenty of existing files (and software) using the others, as well as variations on the chosen format (mostly different names for the real/imaginary components).

I am not sure we can get a nirvana here, the past is the past, and given we had no convention, I don't feel obliged to be backwards compatible, existing software will work with existing files, and new software can surely cope. I'm also not sure that we can hope to predict what HDFx will do, and when it will do it.

That said, I don't have a preference for (a) or (c) at the moment, I was just trying to reflect what seemed like a preference of (c) thus far. However, that preference may have been biased by the fact that when this ticket was open (a) would have been much harder than it is now.

davidhassell commented 1 year ago

I was wondering what compound types actually look like, and found this CDL example, which I've copied below. In particular, it shows a convention for how different units can be applied to different variable components.

netcdf odnc4 { // obs data in netCDF-4, inspired by example from Tom Kunicki
types:
  compound observation_t {
    int time ;
    float temperature ;
    float pressure ;
  }; // observation_t

  compound observation_att_t {
    string time ;
    string temperature ;
    string pressure ;
  }; // observation_att_t

dimensions:                // these two unlimited dimensions are independent
    station = UNLIMITED ;
    observation = UNLIMITED ;

variables:
    int station_id(station) ;
        station_id:standard_name = "station_id";
    float latitude(station) ;
        latitude:units = "degrees_north" ;
    float longitude(station) ;
        longitude:units = "degrees_east" ;
    float elevation(station) ;
        elevation:units = "feet" ;
        elevation:positive = "up" ;
    observation_t  observations(station, observation) ;
        observations:coordinates = 
                "observations.time longitude latitude elevation" ;
        observation_att_t  observations:units = 
                {"days since 1929-1-1 0:0:0", "degF", "hectopascal"} ;

// global attributes:
    :Conventions = "CF-2.0" ;  // or whenever netCDF-4 conventions approved
    :CF\:featureType = "stationTimeSeries" ;
}

JonathanGregory commented 1 year ago

Although I see the appeal, I think that units = {"days since 1929-1-1 0:0:0", "degF", "hectopascal"} would be problematic for CF to adopt, because it's not a UDUNITS-compatible string, so it would break existing software. What does it mean, actually? Is it a vector of strings assigned to the attribute?

bnlawrence commented 1 year ago

Would it really break existing software? If we are going to introduce compound types for our variable types, it's not obvious to me that folks would be trying to interpret the compound type units (as opposed to the component types) with udunits?

piyushrpt commented 1 year ago

This is just a summary of current implementation in GDAL to demonstrate how this can be done.

Here is the example HDF5 file used to test complex compound datatypes for Float{16,32,64}. The file can be found here: https://github.com/OSGeo/gdal/blob/c095015224753fc1f082b50c2d36cf7f1a74a4bd/autotest/gdrivers/data/hdf5/complex.h5

The relevant code for determining data types can be found here: https://github.com/OSGeo/gdal/blob/c095015224753fc1f082b50c2d36cf7f1a74a4bd/frmts/hdf5/hdf5dataset.cpp#L218

Example netcdf file: https://github.com/OSGeo/gdal/blob/c095015224753fc1f082b50c2d36cf7f1a74a4bd/autotest/gdrivers/data/netcdf/complex.nc

This is the relevant code in the netcdf driver, which interprets CF metadata while recognizing the data type correctly:

https://github.com/OSGeo/gdal/blob/c095015224753fc1f082b50c2d36cf7f1a74a4bd/frmts/netcdf/netcdfdataset.cpp#L354

JonathanGregory commented 1 year ago

It could be a problem for any software that scanned the variables in the file to find one with particular units, for instance, or which was harvesting metadata and expected units attributes to be scalar strings.

piyushrpt commented 1 year ago

My 2 cents having worked with a lot of SAR data in both HDF5 / netcdf formats ...

Complex numbers are a fundamental data type nowadays in scientific computing. Support for general compound datatypes is challenging and probably a harder challenge to reach consensus on. But, w.r.t complex numbers:

Compound data types (2 elements of same type) are useful - as shown in the examples above from GDAL
A single UDUNITS-compatible string attribute on the compound data type variable should be sufficient to describe complex numbers. This ensures compatibility with older software and also keeps implementation for data providers simple.

It might also become important to support certain specific datatypes - Float16, N-bit integers etc going forward considering the volume of SAR data that is going to be generated by new missions and increased desire from this user community to use xarray.

bnlawrence commented 1 year ago

@piyushrpt I'm not sure whether your point (2) above is arguing that the string has to be a single valued units string or not?

I've just dived into a rabbit hole about whether or not a complex number is a scalar, suffice to say people can earnestly argue that it is, and that it isn't. However, it certainly appears that if we are going to use complex numbers as scalars, then we can't have multiple units. So I think part of this discussion comes down to deciding whether for us, a complex number is a scalar. Recognising that in practice, communities we care about, may not treat it as one (i.e. they have different units for the different components).

Of course, CF does not have to support their conventions, but we should not get in the way of them being able to build on top of CF.

I guess my point is we should see what CFradial2 is doing now.

piyushrpt commented 1 year ago

I fully agree with what you say @bnlawrence and apologies for not clearly stating it.

Yes, I personally believe complex numbers are scalars in their original (real, imag) form - readily usable with spectral operations etc. Any other form (amp/ intensity, phase) is a derived form and can already be represented using scalar data types already, as in done in CFRadial. End users have to transform these other forms of complex numbers into (real, imag) for spectral operations anyway. Including support for fundamental std::complex<> types, in my opinion, is important and makes more sense than the other forms.

JonathanGregory commented 1 year ago

I too agree that a variable containing a complex number would be OK for CF if the whole variable could be described with CF metadata just as for any other kind of single variable i.e. a single units string, standard_name string, etc. We would need new standard names for complex quantities, of course, but that also should not present a problem.

bnlawrence commented 1 year ago

We discussed this today at a breakout session of the CF annual conference. Three options were considered:

a) compound data types,

b) real and imaginary parts in separate arrays

c) real and imaginary parts in a multi-dimensional array.

By way of an interim summary (before plenary discussion), we think that of the three options listed above, only the first (compound data types) and the third (array with trailing 2 dimensions for real and imaginary parts) should be considered. On balance there is probably a very slight preference for the array option as it might, for some languages, simplify handling the data in memory and it doesn't require implementors in "non-python languages" (netcdf4-python supports them directly) to have to do heavy lifting around the compound data types.

There is strong support for considering complex numbers to be scalars and allowing only one unit (not least because one can argue that complex numbers themselves do not have units, only the real part should). This of course could have implications for supporting applications which have a different conventions, which leads us to specific actions: we intend to a) further investigate what CFradial2 is doing, and what an evolution of that convention might do if we did have complex numbers, b) investigate why many radar communities are using HDF transient data types directly and avoiding NetCDF altogether, and c) look at other potential use-cases in the CF community, particularly around spectral transform models, time series analysis, and storing refractive indices. Who might use these things, and how urgent is it to them?

When we are clear on the implications of the choices (compound data types versus arrays, and scalars with at most one unit versus vectors with multiple units) we think the actual convention itself will be easy to propose and agree, but we will want to do it in conjunction with at least one actual use-case.

bnlawrence commented 1 year ago

@piyushrpt It might be that you are able to argue cogently against the assumption that arrays will be easier than compound datatypes for implementation (R was raised as a potential issue)? Also, if you have anything useful to say about why some radar communities have not followed the CFradial2 approach and bypassed hte lack of direct complex number support and have used transient types in HDF, that'd be really helpful. The group of us discussing this were aware of the practice, but did not have a detailed understanding.

bnlawrence commented 1 year ago

@ZedThree Can you please explain and/or point us to more detail about your netcdf implementation of reading what I presume are HDF transient data types? How would one write them if not using h5netcdf?

ZedThree commented 1 year ago

The long and short of it is that it should work completely automatically in the next version of netCDF. It's implemented in the C library, so should then be available in all other language wrappers immediately. It will allow reading files created by h5netcdf containing complex or bool variables, although they won't be converted to the language-native types (i.e. you'll get a structure with r and i components, but not a complex)

The technical details:

When creating compound types in general in HDF5, it's optional whether or not to commit the datatype to the file(/group). If it's not committed, then the definition of the datatype is stored inside the variable itself and isn't given a separate name -- this is a "transient type". Committing the type is an extra step that's completely optional when using HDF5, which is why I suspect many libraries and applications skip it. NetCDF, on the other hand, always commits the type and gives it a name.

When reading a file with HDF5 directly, the library doesn't actually care where a variable's datatype is stored, so querying the type of a variable always works. The type is either native/built-in, stored directly in the variable, or is a reference to a named type stored elsewhere in the file.

NetCDF reads all named types in a file immediately on opening it, stores them in some structure, and then when reading or querying a variable, looks up its type in this structure. So previously, when a variable had a transient type, this look-up failed.

My fix was basically to just fallback to reading the type from the variable.

Separately, I'm working on a tool to make both reading and writing to/from language-native complex numbers trivial, with APIs for C, C++, Fortran, and Python. I've not looked at R, but if this is a blocker, I can probably find an R expert to help work something out.

sadielbartholomew commented 1 year ago

Thanks for the summary of the discussion yesterday, @bnlawrence. I was in another Hackathon discussion session but would have really liked to have joined this one too, so it is useful to see your outline.

If someone wouldn't mind summarising this bit that wasn't covered, briefly, why was option (2) (which from the thread - particularly this comment - I believe is '[a] single UDUNITS-compatible string attribute on the compound data type variable') considered not a good option by the discussion group?

I also had one question in response to something you mentioned, namely:

There is strong support for considering complex numbers to be scalars and allowing only one unit (not least because one can argue that complex numbers themselves do not have units, only the real part should).

I am potentially being a devil's advocate here, but to me, something which has concrete given units (here, the real part) and something unitless (the complex part, in your hypothetical argument) do indeed have different units - like comparing 0 to any non-zero number, by analogy. For example, in our cfunits we consider (at present, though not that this can't be changed without good motivation) unitless objects as different:

>>> from cfunits import Units
>>> u1 = Units("m")  # taking metres as an example for the real part's units
>>> u1
<Units: m>
>>> u2 = Units()
>>> u2  # special unit-less object
<Units: >
>>> u1.equals(u2)
False

So if only one unit will be allowed for a complex number representation (and I feel like intuitively that seems like a very wise idea to me so would agree with the group yesterday, without having a specific use case in mind), does there need to be a way to specify whether the complex part has the same units as the real part or else is unitless, which I think should be the only two options possible, though for different applications we might want to allow either case rather than enforcing one, so providing configurability somehow (some sort of Boolean flag, maybe)?

bnlawrence commented 1 year ago

@sadielbartholomew Wrt your first question, I don't think option [2] from that thread was an option in itself, it was an additional comment on top of a recommendation to use compound data types. So it falls into the "treat complex numbers as scalars, and hence, by definition of a scalar, allow them to have only one unit" ...

... and that's really the bottom line wrt to your second question. I think the argument that only the real value has units is a bit of a red herring, and I probably shouldn't have included it.

bnlawrence commented 1 year ago

@ZedThree You said:

Separately, I'm working on a tool to make both reading and writing to/from language-native complex numbers trivial, with APIs for C, C++, Fortran, and Python. I've not looked at R, but if this is a blocker, I can probably find an R expert to help work something out.

This is presumably for writing to NetCDF or HDF files using compound data types under the hood?

davidhassell commented 1 year ago

@ZedThree wrote

Separately, I'm working on a tool to make both reading and writing to/from language-native complex numbers trivial, with APIs for C, C++, Fortran, and Python. I've not looked at R, but if this is a blocker, I can probably find an R expert to help work something out.

Would this tool work in conjunction, somehow, with whatever else we're using to create datasets (e.g. netCDF4-python in my case)?

Thanks, David

ZedThree commented 1 year ago

@bnlawrence Yes, exactly. It should also work with the other backing file types that support the netCDF4 format. I could also make it configurable to use a complex dimension when writing.

@davidhassell Yep, the Python API is built on top of netCDF4-python, so you can literally do import nc_complex as netCDF4 and not touch the rest of your code. For other languages, it will be similar, though sometimes function names may have to be changed.

It's currently proof-of-concept, but it does work and you try it out here. Packages coming soon!

bnlawrence commented 1 year ago

@davidhassell Yep, the Python API is built on top of netCDF4-python, so you can literally do import nc_complex as netCDF4 and not touch the rest of your code. For other languages, it will be similar, though sometimes function names may have to be changed.

Have you thought about a pull request directly into netCDF4-python?

In any case, for other readers coming to this thread, you should look at this link from @ZedThree above; it's a great summary of complex number issues. I think we'd be pretty silly to deviate from the analysis (and "blessed solution") he provides.

mike-dixon commented 1 year ago

Hi All

FYI - I want to let you know about our decisions in CfRadial.

We are storing spectra, which can be represented as:

power - i.e. a single scalar power and phase - i.e. 2 scalars a complex pair, also 2 scalars

For the complex case, we chose to store these as 2 separate variables.

We are now working on a CF convention for radar time series data - the so-called In-phase and Quadrature (I/Q) pairs.

Complex numbers apply here as well. We are adopting the same approach - 2 separate variables.

We will probably use the ancillary_variables attribute to indicate the connection between the 2 arrays.

Thanks Mike Dixon

On Fri, Oct 6, 2023 at 2:32 AM Bryan Lawrence @.***> wrote:

@davidhassell https://github.com/davidhassell Yep, the Python API is built on top of netCDF4-python, so you can literally do import nc_complex as netCDF4 and not touch the rest of your code. For other languages, it will be similar, though sometimes function names may have to be changed.

Have you thought about a pull request directly into netCDF4-python?

In any case, for other readers coming to this thread, you should look at this link https://github.com/PlasmaFAIR/nc-complex from @ZedThree https://github.com/ZedThree above; it's a great summary of complex number issues. I think we'd be pretty silly to deviate from the analysis (and "blessed solution") he provides.

— Reply to this email directly, view it on GitHub https://github.com/cf-convention/discuss/issues/369, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABTOD74UJ6UFZD5ZEPVN3BTX5662NAVCNFSM4IVKCMG2U5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TCNZVGAYTSNJTGUYA . You are receiving this because you were mentioned.Message ID: @.***>

davidhassell commented 1 year ago

Hi Mike,

We will probably use the ancillary_variables attribute to indicate the connection between the 2 arrays.

My initial thought on this is it would be better to use a different attribute, rather than to overload an existing one with a use for which it wasn't intended ... but it'd be good first to make sure that I understand what you have in mind. Could you possibly post a CDL snippet giving an example?

Many thanks, David

mike-dixon commented 1 year ago

Hi David I would welcome the addition of attributes specifically for this purpose. Mike

On Thu, Oct 12, 2023 at 7:42 AM David Hassell @.***> wrote:

Hi Mike,

We will probably use the ancillary_variables attribute to indicate the connection between the 2 arrays.

My initial thought on this is it would be better to use a different attribute, rather than to overload an existing one with a use for which it wasn't intended ... but it'd be good first to make sure that I understand what you have in mind. Could you possibly post a CDL snippet giving an example?

Many thanks, David

— Reply to this email directly, view it on GitHub https://github.com/cf-convention/discuss/issues/369, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABTOD73YOIZGSQVUTUCUBYTX67XTFANCNFSM4IVKCMGQ . You are receiving this because you were mentioned.Message ID: @.***>

JonathanGregory commented 10 months ago

Dear @mike-dixon

On 6th October, you wrote

Complex numbers apply [for radar in-phase/quadrature pairs] as well. We are adopting the same approach - 2 separate variables. We will probably use the ancillary_variables attribute to indicate the connection between the 2 arrays.

I think that choice will work well in CF. @davidhassell replied on 12th

My initial thought on this is it would be better to use a different attribute, rather than to overload an existing one with a use for which it wasn't intended ... but it'd be good first to make sure that I understand what you have in mind. Could you possibly post a CDL snippet giving an example?

meaning a new attribute, other than ancillary_variables, might be defined to link the arrays, and (just above) you said you like that idea. Please could you give an example use-case in CDL to move this discussion forward?

Best wishes

Jonathan

ChrisBarker-NOAA commented 10 months ago

to be clear for me, when you say:

""" a complex pair, also 2 scalars """

Is that two complex numbers, or one complex number (which requires 2 values to represent. I think you mean the latter, but want to make sure.

Anyway, conceptually, a complex number IS a scalar -- so to the degree possible, it would be really great to treat it as such in CF. If that is not possible to do properly (a complex type). and we need to use two variables, I think they should be defined specifically as a complex number, rather than by overloading an existing concept.

Which ai think is what @davidhassell is suggesting.

davidhassell commented 10 months ago

Hi @ChrisBarker-NOAA,

Is that two complex numbers, or one complex number (which requires 2 values to represent. I think you mean the latter, but want to make sure.

I think the latter, too, i.e. a complex number is conceptually a scalar, but in netCDF needs two real numbers to represent it, one of which is implicitly multiplied by i.

If that is not possible to do properly (a complex type). and we need to use two variables, I think they should be defined specifically as a complex number, rather than by overloading an existing concept.

Yes. Although pending @mike-dixon's CDL CF-radial example, I'm not yet sure what the new structure would look like.

ZedThree commented 10 months ago

I'd just like to plug my nc-complex library again. The C API is now production ready, and support for it has been integrated into netcdf4-python directly and will be available in their next release. I'm currently working on the C++ and Fortran APIs, and they will be available soon.

The library supports using a dimension or a compound type to represent complex numbers (and multiple different conventions), and is completely interoperable with the standard netCDF library. It's also trivial to use: switch the nc_ prefix to pfnc_ on a small number of functions, for example: pfnc_get_var_double_complex to read complex data, or pfnc_def_var with PFNC_DOUBLE_COMPLEX to define a new complex variable.

I haven't yet added support for complex numbers as two separate variables, as there are some difficult questions like how to represent the name of the variable, should attributes be merged, and so on. A lot more functions would also need to be wrapped in order to handle accessing two variables as a single one.

@mike-dixon Is there some discussion available somewhere as to why CF-Radial has chosen to use two variables? I'd like to understand the benefits of using them over a separate dimension or a compound datatype.

ajelenak commented 9 months ago

FYI... HDF Group has just published an implementation RFC for float16 and complex number datatypes in libhdf5. You can provide comments or suggestions at https://github.com/HDFGroup/hdf5/discussions/3339.

ChrisBarker-NOAA commented 9 months ago

HDF Group has just published an implementation ... complex number datatypes in libhdf5

I think this is great news! Given that we haven't yet established a CF convention for complex yet, it seems the obvious thing to do is adopt the HDF complex -- assuming it gets into netcdf soon. :-)

As for CF-Radial -- I'm still confused about whether they are storing complex numbers, or 2D-values-that-have-a-magnitude-and-direction (now that I wrote that -- 2D vectors), which can be mathematically represented as complex. Which seems to me a different thing. For example, in my line of work, we often need to represent wind speed and direction -- but I've never seen anyone call that a complex number. And if CF, we don't use the speed-direction form, standardizing on the eastward-northward form instead.

ZedThree commented 4 months ago

A brief update, as a couple of things have happened recently:

netcdf4-python can now read and write complex numbers. There's a new auto_complex argument to netCDF4.Dataset which will automatically convert complex numbers to/from a compound type.

The RFC to add native complex numbers to HDF5 is progressing, and is likely to be in version 1.15. It will also have support for no-op conversions to/from compound datatype representations.

This hopefully gives us some more pressure to add native complex numbers in netCDF itself. My nc-complex library is an extension that adds support for complex numbers for C and C++. I need to find some time to add Fortran support too.

cf-convention / discuss

A convention for complex numbers? #369