TEOS-10 / python-gsw

Python implementation of the Thermodynamic Equation Of Seawater - 2010 (TEOS-10)
MIT License
45 stars 22 forks source link

Python implementation of TEOS-10: comparison, strategy #26

Closed DamienIrving closed 7 years ago

DamienIrving commented 7 years ago

I was talking to the authors of TEOS-10 (McDougall and Barker) and they were saying that there are errors in the Python implementation of TEOS-10 contained in this repo. They say that the only trustworthy versions of TEOS-10 code (for any language) are found at the TEOS-10 website (i.e. and not on GitHub at the TEOS-10 organisation page).

Since they have a C version of the code available at the TEOS-10 website, I'm wondering if anyone is planning to write a python wrapper for it?

In the meantime, should this repo be taken down? I know people (including myself) who install gsw as follows, completely unaware of any errors with the code:

conda config --add channels conda-forge
conda install gsw 
durack1 commented 7 years ago

@DamienIrving do you have some more specifics about the "errors"?

DamienIrving commented 7 years ago

@PaulMBarker - Are you able to help out with @durack1's question. Do all the functions have errors, or just some of them?

efiring commented 7 years ago

The C version is a scalar translation of the Fortran-90 version, so I don't think it forms a good basis for Python wrappers. I have no confidence that it will even be maintained. The python version is based on earlier versions of the matlab code, and is incomplete compared to the matlab code. I am not aware of actual errors in the python version, however, other than those I documented in #10--which reflect differences between versions of the matlab code. The python version needs considerable work, but not deletion from github.

Edit: as noted below, I have changed my mind about wrapping the C version; we may end up doing this in the Python version for a large subset of the functions, including all of the core polynomial evaluations.

PaulMBarker commented 7 years ago

We are looking for someone to maintain the C version, even if it requires us to pay them. I have had reports that the C version works with Python and it passes gsw_check_functions, I have no idea if that is true for python.

I received this a few months ago: "Among the differences between the python and the C gsw, for instance, I realized that the python code still uses a 48 polynomial to derive specific volume (versus the 75 one in the C code) resulting in density differences of ~5e-4."

When a package passes check_functions I am more than willing to add a zipped version of it to the TEOS-10 website for distribution.

The ONLY place where software should be downloaded from is the TEOS-10 website. We have a responsibility that any software we distribute is correct. (R is the exception to this and I know I need to add a link to the R version but I have been flat out.)

Paul.

castelao commented 7 years ago

I agree with @efiring that it's not a reason to delete the repository. Python-GSW is classified as beta, right? We should be able to handle this in the traditional way, opening issues and PRs.

What are the confirmed errors? Do we have an issue opened for each of those known errors with details?

@efiring , how can I help you? I can figure out some time to help with this.

efiring commented 7 years ago

@castelao The fundamental problem is that TEOS-10 is an enormous conglomeration of code and documentation, that has gone through several versions via massive changes. The Python implementation has not kept up. There is some discussion of this, including identification of failing checks, in #10. To date, neither @ocefpaf nor I have had the rather large chunk of time that would be required to bring everything up to version 3.05. In addition, there is probably still some work needed on tools for automatically translating the documentation from the Matlab to the Python--ideally. In the spirit of patching things up rather than trying for a complete and clean solution, it seems we need to identify exactly which functions need to be replaced entirely because they use newer algorithms, and then which ones need to be modified to use these new versions. The core information may already be in the discussion on #10. The starting point is an existing PR, #7, which is a first crack at translating the new core algorithm, but which needs quite a bit of work before it will be ready.

PaulMBarker commented 7 years ago

Being that it is a beta, until this work is done, how do we stop users downloading it and using it as though it is "blessed" code?

efiring commented 7 years ago

Building a python version via wrappers around the C or Fortran is always an option. Of the two, the C is easier to deal with, especially if one wants to be able to compile on Windows. A reasonable approach would be to come up with an engine for generating vectorized cython wrappers to the scalar C functions that are needed by users. This should not be terribly difficult, because there are only a few signatures that need to be supported. But it is still an almost total restart--a new project--and it has the disadvantage of being harder to build and install than the present pure-Python approach.

efiring commented 7 years ago

@PaulMBarker we don't. That is just not the way things work, and not the way to make progress. Really, how much damage is being done by people using the 48 polynomial that used to be "blessed"? If you really think this is a crisis, and you want to take action, you could ask @ocefpaf to stop packaging it for conda.

durack1 commented 7 years ago

@efiring @castelao just FYI I have been keeping track of the versioning of the GSW-Matlab releases in my personal repo https://github.com/durack1/GSW-Matlab/releases. This should provide a way of determining changes dating back to v1.0 and most importantly all the incremental changes within the v3.05 release (there have been 8 updates to date)

ocefpaf commented 7 years ago

Apologies for not addressing the concerns here earlier. (I am moving to another city and I am out for a few weeks.)

Like @efiring mention I do not have time for this project due to my day job. I would love to work on python-gsw again as this has a special place in my heart. (gsw was my first Python project back when I was learning Python from awesome expert developers like @efiring and Bjørn Ådlandsvik.)

I failed to bring more developers to this project and the development is halted for many years now. However, even incomplete and outdated, there are many projects out there that rely on this library, and I am reluctant to remove it. I would love to find funds to move this project forward though.

So my suggestion is to keep the unofficial status, maybe add a caveat note until I can find funds or extra time to bring tis library up to date with latest Matlab gsw.

PS: For those interested in the c wrapper I suggest you contact @lukecampbell and ask about his goals for that library. Maybe he can take up on the offer to updated it.

durack1 commented 7 years ago

@PaulMBarker if I was to vote, I believe that if any funds are available to update the TEOS-10 libraries, python (and not the C or fortran) is the language that is prioritized.. Removing the need for compilation (in the case of C or fortran) and also having a platform independent (and open source, rather than matlab licensed) version is a far superior solution

richardsc commented 7 years ago

I realize this is probably not the appropriate place for this discussion (i.e. we've gone off topic, which I tend to discourage in Github issues for repos that I myself maintain ...), but I think this is a great discussion.

If priority was being assigned to a particular TEOS-10 language implementation, I'd disagree with @durack1 that python should be prioritized, but rather believe that maintaining and fully implementing the TEOS work in C would be the most beneficial to the entire community. This is not because I dislike python (python is great!) or because I mainly use R (I also love R!) but because each of those languages provide the facility for including C code (to my understanding Matlab can do this too, but I do agree with @durack1 about the license issue).

Yes, compilation is potentially a barrier, but only for developers and testers, and shouldn't affect downstream users. E.g. for R, installing packages that include C code through the regular package distribution system (CRAN) enables access to pre-compiled binaries for each platform (this is the case for the GSW-R library). My understanding of python package managers is limited, but I understood that conda and the like do the same (please correct me if I'm wrong).

Another reason that I understood for supporting "lower level" implementations is that they are then available to be included directly in primitive equation models (not likely to be coded in python or another interpreted language). Of course, many commonly used models are in Fortran, which complicates that argument ...

durack1 commented 7 years ago

@richardsc interesting perspective, which I have trouble disagreeing with.. I do note however that there are 4045 noted downloads for the conda-forge gsw implementation which is a pretty decent number (and 10/11 repo forks versus 2 each for fortran, c and R, and 1 for JS).. So it certainly does appear to be the highest volume language variant. I'd be curious about the total download count for the other package flavours.. I wonder if @PaulMBarker has these stats from the TEOS-10 homepage to interrogate?

PaulMBarker commented 7 years ago

What Clark said is why we have prioritised C over Python. C is even more important than Matlab, as matlab can happily call C.

The path to C is Matlab to Fortran then to C. As Clark mentioned ocean models are in Fortran so it is very important to us that we have GSW in Fortran and we are very fortunate to have access to Glenn.

I am not all that worried about the python calling the 48 term equation for specific volume rather than the 76 term version, but I need to know that all of the functions included in GSW(Python) passes the check values for that version. Does it 100 % reproduce Absolute Salinity from Practical Salinity for the entire ocean, including the Baltic Sea ? If it doesn't then the conda package must be removed.

PaulMBarker commented 7 years ago

FYI, the rough current downloads are over 7000 Matlab, about 1000 Fortran, and 700 in C.

richardsc commented 7 years ago

From the CRAN stats (see the README.md at the bottom of the GSW-R page) there are some download statistics, but I think a great deal of those are for distribution out to the various CRAN mirrors. Of course, in some sense a download (or really, installation) of GSW-R should count as a download of GSW-C, since the R package is really just wrappers around the C functions.

However, as GSW-R is a required package for the oce package we know that every download of oce also installs GSW-R automatically (and the TEOS-10 framework is integrated very tightly into the objects and analysis methods of that package). While the oce download numbers are also wildly high, @dankelley and I are fairly confident that it has been downloaded > 1000 times over the last few years, as it is used extensively in several large government research labs, is part of the grad/undergrad oceanography curriculum at at least one university, and has been the focus of several advanced workshops on oceanographic analysis (usually 30-40 people per workshop).

efiring commented 7 years ago

On 2017/02/24 2:42 PM, PaulMBarker wrote:

Does it 100 % reproduce Absolute Salinity from Practical Salinity for the entire ocean, including the Baltic Sea ?

Paul, what is the test function you are referring to here?

efiring commented 7 years ago

We have some test jigs. In one of them, we run a bunch of functions with the same inputs in octave, using the matlab v3.04 version, and in python. With that test, we get the same results between octave and python for SA_from_SP and for SA_from_SP_Baltic. The Baltic test data are:

              SPb=np.array([6.5683, 6.6719, 6.8108, 7.2629, 7.4825, 10.2796]),
              latb=np.array([59., 59., 59., 59., 59., 59.]),
              lonb=np.array([20., 20., 20., 20., 20., 20.])

The non-Baltic test data are:

              p=np.array([10., 50., 125., 250., 600., 1000.]),
              SP=np.array([34.5487, 34.7275, 34.8605, 34.6810, 34.5680,
                           34.5600]),
              lat=np.array([4., 4., 4., 4., 4., 4.]),
              lon=np.array([188., 188., 188., 188., 188., 188.]),
PaulMBarker commented 7 years ago

Eric, that looks promising. What we really need to do is convert say a climatology with SA_from_SP with the matlab or octave code and compare it to the python converted version. I pretty much always use Gouretski and Koltermann when I do any tests.

If that matches then we need to check CT_from_t and specific volume.

efiring commented 7 years ago

Using a whole climatology for this test seems like overkill, doesn't it? Are you doing that for all of the other versions?

We are already running a specvol_t_exact test, which passes. I just now added a CT_from_t test, and that passes also. Of the octave comparison tests we are running, the only one that doesn't give an exact match is t_from_rho_exact, but the difference is at the level of the double precision limit, so I don't think it should be considered a failure:

t_from_rho_exact: Failed
octave:
[ 28.76086681  28.42682041  22.83462831  10.29431838   6.94630684
   4.48488926]
python:
[ 28.76086681  28.42682041  22.83462831  10.29431838   6.94630684
   4.48488926]
python - octave:
[ -4.01456646e-13   5.68434189e-14   2.45137244e-13   2.39808173e-13
   1.05337961e-12  -5.08926234e-13]
PaulMBarker commented 7 years ago

Yes it is probably is overkill but this is my standard test for spatial varying data.

For the functions using gibbs or based on the polynomial fit for specific volume I use a very fine 3d gridded data set covering the entire TEOS-10 range.

Every time I have not done this full testing it has bitten me on the bum at a later date.

Once the code to make the data set up is written, it only takes a 20 -30 minutes to run these tests. In developing the code, which forms a basis for so much of everyone's work, I always felt a responsibility to the oceanographic community to spend the time running them. 99% of users run the code as a black box, where numbers go in and a number comes out and they trust us the numbers are correct.

PaulMBarker commented 7 years ago

I forgot to add, in the gsw data set there are values for how accurate a calculation should be, these are listed as _ca where is the function name.

efiring commented 7 years ago

One of our test jigs attempts to translate each of the test functions in the matlab gsw_check_functions.m into equivalent python, and run it. At present it successfully translates and runs 116 of those tests with no errors; 5 have discrepancies as noted in #10; and the remainder don't run either because we don't have the corresponding python function or because the automatic test translator can't handle the test.

Where is the complete, runnable code plus data for doing your more extensive climatology tests in any of the languages?

lukecampbell commented 7 years ago

Hey Folks, I brought the TEOS GSW C Library to github and onto the packager homebrew for mac users. This was in line with use for the Ocean Observatories Initiative when I started. The version I integrated was 3.0.1, there's a newer version 3.0.5 which I can integrate as well at some point if there's interest.

The python wrapper around this C library is available on PyPI: https://github.com/lukecampbell/pygsw

The problem with the wrapper and the C library, if I recall correctly, there are several functions missing from this that are part of the MATLAB library. I haven't touched it in years so my memory of it is foggy.

Edit: The original link was to my fork of python-gsw, this link is to the right repo

efiring commented 7 years ago

The problem with that wrapping strategy is that it yields scalar functions in Python. I think the way to do this is to use the C functions to make numpy ufuncs, and then do a little additional wrapping of those. What I have in mind would be automated via code generators, so it would not require hand work on each of the 179 functions presently in the C version.

lukecampbell commented 7 years ago

I think that's what I have here: https://github.com/lukecampbell/pygsw/blob/master/pygsw/vectors.py

lukecampbell commented 7 years ago

But to be honest, I don't remember working on this at all, so I'm not sure if it wraps it correctly, if it works or not.

efiring commented 7 years ago

Using np.vectorize does provide support for array inputs, but it generates a Python for-loop with a Python function call for each array element, so it is slow. It is certainly a valid way to get something based on the C code working, but I think we can do much better. (I am assuming that each C function call is fast enough that all the Python-level for-loop and function overhead dominates the execution time, but I haven't tested this. If my assumption is incorrect, then your style of wrapping would be fine, but I would still want to do it via a code generator, which makes it easier to stay in sync with changes in the C and Matlab versions.)

dankelley commented 7 years ago

In case it's of any help, in the R implementation, we do e.g. (https://github.com/TEOS-10/GSW-R/blob/master/R/gsw.R#L236)

gsw_alpha_on_beta <- function(SA, CT, p)
{
    l <- argfix(list(SA=SA, CT=CT, p=p))
    n <- length(l[[1]])
    rval <- .C("wrap_gsw_alpha_on_beta",
               SA=as.double(l$SA), CT=as.double(l$CT), p=as.double(l$p),
               n=n, rval=double(n), NAOK=TRUE, package="gsw")$rval
    if (is.matrix(SA))
        dim(rval) <- dim(SA)
    rval
}

which is R code calling a C wrapper. And that wrapper is defined as (https://github.com/TEOS-10/GSW-R/blob/master/src/wrappers.c#L160)

W3(wrap_gsw_alpha_on_beta, gsw_alpha_on_beta, SA, CT, p, n, rval)

which calls a three-argument C macro defined as (https://github.com/TEOS-10/GSW-R/blob/master/src/wrappers.c#L116)

#define W3(wname, cname, arg1, arg2, arg3, n, rval) \
void (wname)(double *(arg1), double *(arg2), double *(arg3), int *(n), double *(rval))\
{\
    for (int i=0; i < *(n); i++) {\
        (rval)[i] = (cname)((arg1)[i], (arg2)[i], (arg3)[i]);\
        if ((rval)[i] == GSW_INVALID_VALUE) {\
            (rval)[i] = NA_REAL;\
        }\
    }\
}

The upshot of this is that writing wrappers can be quite an easy coding task if you use wrappers. After this, the C code is just what we get from the gsw, e.g. https://github.com/TEOS-10/GSW-R/blob/master/src/gsw_saar.c (although I think I had to modify this and other C files very slightly, e.g. R does not want direct printing so I think I had to change some fprintf() calls to the R variant, which captures output and sends it either to the terminal or a gui, as appropriate).

I did this mainly because loops in R are slow, but the other benefit is that the base code, in the C files, is almost entirely unmodified. I am scared of transliterated code, although in this teos case it's not very scary because sensible languages like R and python can have test suites.

castelao commented 7 years ago

@PaulMBarker , what's the difference between the 75-terms and 76-terms formulation? If those are the same thing, which is the correct one?

PaulMBarker commented 7 years ago

it was a typo, the correct number is 75.

PaulMBarker commented 7 years ago

I tried to upload the test data set for SA_from_SP but it kept failing. I have ftp'ed it to the teos-10 site http://www.teos-10.org/software/SA_from_SP.zip The file is a zipped csv file containing (p,long,lat,SP,SA).

I have not been closely monitoring the python translation but I just noted there is an entry titled "Inland lat/lon returns NaN. Should use limnological equations" I hope that python is not returning NaN's for profiles that could be in lakes.

efiring commented 7 years ago

Paul, that link doesn't seem to be accessible at all. Can you use Google Drive or Dropbox, so I could pick it up from there? If so, or if not, best to arrange this or an alternative via personal email.

efiring commented 7 years ago

@PaulMBarker I have made the comparison; see http://currents.soest.hawaii.edu/efiring/for_TEOS-10/ for all files, and http://currents.soest.hawaii.edu/efiring/for_TEOS-10/compare_python_wrapped_c.html for code, description, and results. The main point of interest is that there is a swath of points in the Caribbean where there are small differences between the Matlab and the C and Python; the C and Python agree with each other much better than they do with the Matlab. I suspect you will see the map of that swath, and will be able to say, "Oh yes, there was a subtle change between 3.04 and 3.05 affecting that region." Or something like that. I implemented the SAAR code from one of the earlier Matlab versions (3.03, maybe). Frank also started the C from earlier (Fortran) versions, and evidently did not get everything updated to perfectly match the 3.05.

(For anyone looking at this several months from now, there is no guarantee that the web URLs given above will still be there; they are temporary, for near-term use.)

efiring commented 7 years ago

I have changed the title of this because it was misleading; no significant errors were identified in the discussion.

efiring commented 7 years ago

Last point here before I close this: I have the start of a new implementation based on wrapping the C: https://github.com/efiring/python-gswc. So far, it automatically wraps 120 functions, complete with numpydoc docstrings. Some of these are not user-level functions, but nevertheless, I find it encouraging.