TEOS-10 / GSW-R

R implementation of the Thermodynamic Equation Of Seawater - 2010 (TEOS-10)
http://teos-10.github.io/GSW-R/
Other
8 stars 5 forks source link

problem with qsort_r #41

Closed dankelley closed 7 years ago

dankelley commented 7 years ago

I tried submitting the present version to CRAN, but got complaints as follows. The CRAN maintainers are quite stringent about such problems, and have asked me to address this "ASAP". (I was surprised that the complaint was not also about qsort_s, additionally, since that doesn't seem to be POSIX, either.) CRAN is quite stringent about these things.

qsort_r is not a standard C (nor POSIX) function: §1.6.4 of the manual told you to test for non-C functions. And your platform-specific code in the C file using this is not in the spirit of the CRAN policy:

'Package authors should make all reasonable efforts to provide cross-platform portable code.'

You cannot know what future versions of the OSes you pick out will provide!

Similarly, you should not be including a guessed prototype but using the system header: that seems to be stdlib.h, but glibc require _GNU_SOURCE (or perhaps similar for older versions) to be defined.

dankelley commented 7 years ago

It seems to me that the simplest scheme (unless someone who maintains GSW-C wants to get involved, providing a POSIX-compliant solution...) may be to patch gsw_oceanographic_toolbox.c to comment-out the existing code (including the #ifdefs and the function declarations) and to switch to using the ordering functions that R provides for use within associated C code.

Upside: This avoids what I called a "rolling fog" in a previous issue report ( #24 ), by simply letting the CRAN maintainers worry about compatibility across the 12 build systems (and I think we must multiply that number by either 2 or 3, to account for versions of R on which things are expected to work). Doing that should prevent GSW-R getting into a poor-reputation state with CRAN, caused by rapid resubmissions of problematic code.

Downside: It complicates GSW-R document a bit, because it means we can no longer say we use the GSW-C code, but rather that we use that code, with a patch to replace the qsort_r/qsort_s calls with calls to an R sort function.

The R sort function that seems right is R_qsort_i, which is described at https://cran.r-project.org/doc/manuals/r-release/R-exts.html#Utility-functions. It does not handle NA values, but then I don't think gsw_util_sort_real does, either.

hylandg commented 7 years ago

Dan,

You need to take note of the comment in gsw_util_sort_real

    /*
    **  Note that the library functions using this utility
    **  depend on the fact that for replicate values in rdata
    **  the indexes are returned in descending sequence.
    */

which means that the sort routine you choose needs to be stable (ie. should preserve the order of items of equal value). The documentation for R_qsort_i states

... Note that the ordering is not stable, so tied values may be permuted.
dankelley commented 7 years ago

Excellent point, Glenn. Thanks!

efiring commented 7 years ago

In my understanding, the quicksort algorithm is generally not stable, in contrast to heapsort, for example; so either glibc qsort_r is not actually quicksort and is stable (unlike glibc qsort), or the C code is not consistent with the comment. https://www.gnu.org/software/libc/manual/html_node/Array-Sort-Function.html

hylandg commented 7 years ago

qsort_r, which may or may not use quicksort internally, is not stable (neither is heapsort). But by using an appropriately defined lexicographic key comparison function it can be made so - this is the role of function compare (see below) in gsw_util_sort_real.c. The Fortran version (which uses quicksort + heapsort + insertion-sort) uses the same technique.

compare(const void *p1, const void *p2, void *rarray)
{
    double  *rdata = rarray;
    if (rdata[*(int *)p1] < rdata[*(int *)p2])
        return (-1);
    if (rdata[*(int *)p1] > rdata[*(int *)p2])
        return (1);
    /*
    **  Note that the library functions using this utility
    **  depend on the fact that for replicate values in rdata
    **  the indexes are returned in descending sequence.
    */
    if (*(int *)p1 < *(int *)p2)
        return (1);
    return (0);
}
efiring commented 7 years ago

My recollection of heapsort's stability was wrong, but I think there is still an error in the code.

From the manpage on OS X: "The algorithms implemented by qsort(), qsort_r(), and heapsort() are not stable; that is, if two members compare as equal, their order in the sorted array is undefined. The mergesort() algorithm is stable."

From the manpage on linux: "The comparison function must return an integer less than, equal to, or greater than zero if the first argument is considered to be respectively less than, equal to, or greater than the second. If two members compare as equal, their order in the sorted array is undefined."

That implies to me that the only way to make the sort stable is to ensure that the compare function cannot return 0. In other words, that last return statement should be return (-1);.

hylandg commented 7 years ago

Yes you are right - the last return should be return (-1);.

Note that this technique using the compare function only works because we are sorting an array of indices and not the actual data array rarray. So the actual data array rarray remains untouched and this allows us to make the index comparison in function compare.

dankelley commented 7 years ago

I wonder what @hylandg and @efiring (and others) think of the following. It could be a solution to (a) a complex system of ifdefs that could be brittle if GSW-C is ported to other machines and (b) the particular problem I have with the R case (as noted in the initiating comment of this issue).

My idea is to use the bog-standard qsort, which I think we can be sure is on all machines, and acts identically on them. (It goes back many decades and has been used a lot by a lot of people.) The only change is to pass a structure to it (and to its comparison function). That structure would hold the value, paired with the initial index. Then, we just need to order by the value if possible, and if there's a tie, order by the initial index. I think that this makes qsort stable, and it seems to be the strategy that gets recommended on discussion boards like stackexchange.

I made a little test case, and am putting it and its output below. The test data are burned into the C code, and were created by R (so I could check the results against R). I've not done much testing, but I get with this what I'd expect, when I order this particular test case by hand. Any thoughts, Glenn or Eric (or others?). It would be quite sweet if GSW-C could get rid of the #ifdef commands and the use of three different ways of sorting (qsort_s and the two competing GNU and BSD definitions for gsort_r).

Here's the code:

#include <stdlib.h>
#include <stdio.h>

//#include "a.dat"
int nx=10; // from a.R
double x[10] = {4,3,2,2,4,4,1,2,2,2}; // data, from a.R
int e[10] = {7,3,4,8,9,10,2,1,5,6}; // expected order, from a.R
// above from a.dat (the only point of this is to easily try different test cases)

typedef struct {
    double x;
    int i;
} XI;

int compareXI(const void *a, const void *b)
{
    XI *A = (XI*)a;
    XI *B = (XI*)b;
    if (A->x < B->x) 
        return(-1);
    else if (A->x > B->x) 
        return(1);
    else if (A->i < B->i) 
        return(-1);
    else if (A->i > B->i) 
        return(1);
    else {
        fprintf(stderr, "programming error in compareXI function\n");
        exit(1);
    }
}

int main(int argc, char **argv)
{
    // Fill up structure with data and initial index (the latter in C notation)
    XI* xi = (XI*)malloc(nx*sizeof(XI));
    for (int i = 0; i < nx; i++) {
        xi[i].x = x[i];
        xi[i].i = i;
    }
    // Sort this structure, comparing values first and then comparing
    // indices, if values are identical.
    qsort(xi, nx, sizeof(XI), compareXI);
    printf("%20s %20s %20s %20s %10s\n", "x", "e", "xi.x", "1+xi.i", "OK?");
    // Show the results. We expect the fourth column to equal the second.
    for (int i = 0; i < nx; i++) {
        int OK = e[i] == xi[i].i + 1;
        printf("%20.1lf %20d %20.1lf %20d %10s\n", x[i], e[i], xi[i].x, 1+xi[i].i, OK?"yes":"no");
    }
    free(xi);
    return 0;
}

and here's the output (which I think demonstrates correctness, albeit on just one little test)

                   x                    e                 xi.x               1+xi.i        OK?
                 4.0                    7                  1.0                    7        yes
                 3.0                    3                  2.0                    3        yes
                 2.0                    4                  2.0                    4        yes
                 2.0                    8                  2.0                    8        yes
                 4.0                    9                  2.0                    9        yes
                 4.0                   10                  2.0                   10        yes
                 1.0                    2                  3.0                    2        yes
                 2.0                    1                  4.0                    1        yes
                 2.0                    5                  4.0                    5        yes
                 2.0                    6                  4.0                    6        yes

NOTE I have not looked into stitching this into the GSW-C code, but I'd be happy to do that, if Glenn and Eric think it's worth doing but don't want to bother trying. I realize that I'm the one pressing for this change, because of the "fix ASAP" instruction I got from the CRAN organizer. But changing basic GSW-C would seem to me to be a smart idea, if my code is along the right lines, because removing #if commands is always a good idea, in terms of porting to new machines (or old ones).

dankelley commented 7 years ago

To explore the idea further, I forked GSW-C and switched the code to use this proposed method. The results are at https://github.com/dankelley/GSW-C

What I have compiles, and gsw_check still gives a "pass", but I've no idea whether that means anything since I've not looked in detail at the checks done by gsw_check.

I could do a pull request if that seems sensible to others. I've not done that yet because others might think I'm being a bit pushy here. After all, I may be the only one in the TEOS-10 group who is disquieted by the #if and the function prototypes. Still, we know from https://github.com/TEOS-10/GSW-R/issues/41#issuecomment-320123793 that the existing GSW-C code is wrong, so it seems to be a good time to get a few sets of eyes on my proposed code change, in case that will be of use.

efiring commented 7 years ago

Dan, I think the general idea is good, but your first shot at the implementation has problems: 1) Unless and until there is a major overhaul of the C code, the changes need to be made in toolbox/gsw_util_sort_real.c, not in gsw_oceanographic_toolbox.c. The latter is made by running make -f GNUmakefile. 2) In library code like this, never call exit. A library function should never kill the interpreter (R or Python) that is calling it--or have the machinery to do so, even if you know it will never be invoked. Here, I think I would just use the same sort of simplification that is in the original compare function (but corrected, of course). 3) It looks to me like you are resolving the tie in the opposite way to the original. In the original, 1 is returned if x1 > x2, or if x1==x2 and i1 < i2. In yours, that last condition is x1==x2 and i1 > i2.

dankelley commented 7 years ago

Thanks, Eric -- those three things are altered in the most recent commit in my fork at https://github.com/dankelley/GSW-C

Question: Can you explain why the original code handles the sort-on-index in that way? It seems incorrect to me, if the goal is to get a list sorted by increasing value, using lower indices to break ties. (I've not looked at any of the embedding code, though. Maybe this gets used for something that is in reverse order from the start, or something.)

efiring commented 7 years ago

The comments in the code that calls this say that it doesn't matter; it just has to be consistent. I haven't worked out the algorithm in detail, so it's not clear to me whether the direction of consistent tie-breaking matters. The sorting is applied to the concatenation of the sorted x-from and the x-to values, so in cases where an x-from matches an x-to, the tie-breaking determines which comes first.

Logically, which consistent choice is made shouldn't matter; if it did, then one would get different results if one inverted the order of all of the input arrays. So if you want to keep your original order, which looks nicer ( greater-than always means return 1), then I'm confident that would be OK.

dankelley commented 7 years ago

1. Sort-on-index. I reckon it may be best for my suggested change to be similar to the original, to avoid confusion. Therefore I propose to keep what I have in my fork, viz. an echo of the old code.

2. Thread safety. I don't know whether a switch from qsort_[rs] to qsort removes thread safety. This is relevant because I see in the comments in the code for gsw_util_sort_real that it is supposed to be thread-safe. I think qsort must be thread-safe, because it's not named on the list of non-safe functions at http://man7.org/linux/man-pages/man7/pthreads.7.html, and that document suggests that everything POSIX that's not named there is thread-safe.

Perhaps @hylandg has some thoughts on these things, since I see from git-blame that he wrote the original code in Aug 2015 and (seems like a good month!) Aug 2016.

efiring commented 7 years ago

https://access.redhat.com/solutions/172383

I think the answer is that if qsort is not thread-safe, then that is a bug in the particular libc implementation, so it's fine to use it in GSW-C. I wouldn't worry about it.

hylandg commented 7 years ago

I've spent sometime going over the code and running though some detailed examples, and I've determined that the direction of tie-breaking doesn't matter, and in fact whether the sort is stable or not doesn't matter. The final result from gsw_linear_interp_sa_ct (and gsw_util_interp1q_int - the only routines that call gsw_util_sort_real) is the same - although the calculation path is slightly different (because one interpolates from indices i-1 to i with u=1, and the other interpolates from i to i+1 with u=0).

So I believe using qsort will work just fine (with the stability adjustments included or not). The only reason I can think of for having a stable version is for consistency with the original matlab code (and the Fortran version). Apologies for making this a bigger deal than it needed to be.

PaulMBarker commented 7 years ago

I really want to move away from linear interpolation, it is just for "rough & ready cowboy/cowgirl " oceanographers. I have started doing this in the matlab version where the worst ting that is resorted to is a spline with tension. With a high tension value the results are similar to linear but the derivative is continuous.

hylandg commented 7 years ago

Paul, that sounds like a good move, but I am concerned it may leave other languages behind. For example, if I do the conversion from the latest Matlab v3.06 release to Fortran how do I deal with Matlab specific functions such as pinv and pwch? Seems like a lot of work involved.

PaulMBarker commented 7 years ago

I thought that pinv is included in the lapack library or is that not the case ?

efiring commented 7 years ago

Paul, don't forget: 1) The data have instrumental errors, and include all sorts of phenomena that are noise from the standpoint of geostrophic calculations. 2) If the data are finely sampled, as is now typically the case, interpolation doesn't do much; it doesn't matter what method one uses. 3) The fancier the method, the more explanation is required when one uses it. 4) You never know what is really there between the sampled points.

I take strong issue with your blanket condemnation of linear interpolation, and your disdain for those who use it. What are the situations where the conclusions reached from a calculation on a data set will be substantially improved by a fancier algorithm? And what are the situations where the difference it makes is negligible compared to the other uncertainties? How often will TEOS-10 be used in the former situation, and how often in the latter?

PaulMBarker commented 7 years ago

linear interpolation is one of my pet hates, it caught me out once when I was still a student, it took me weeks to sort out, and then it was only by talking to the guys who used to get their data from niskin bottles in the 50's, 60's and 70's.

In modern data almost any interpolation is ok, but having a continuous derivative in the resulting data makes a method make far more sense than that not having one.

We found that with modern data we can eliminate the very small scale instabilities by adjusting salinity less than the accuracy of the instrumentation, so it really comes down to how much you believe the raw data and those processing it. Looking over the processor shoulders I have noticed that the data given out is strongly influenced by each of the operators. One must also think what they want to do with the data, if you are looking at mixing then you will not want to remove those instabilities. But if you are looking at global circulation or long term signals then keeping short time events that might last just 20 mins is questionable.

We are now doing our interpolation on the SA-CT diagram. Using linear interpolation with SA-p and CT-p produced unrealistic SA-CT curves as did the Reininger & Ross (1968) method. In the Southern Ocean there are so few deep casts we need to reconstruct the water column the best we can. I agree we do not know what was occurring between our measure bottles but we can make an educated guess using what we know from modern data.

dankelley commented 7 years ago

The last few comments on this thread have opened up the chance for a wide-ranging discussion, touching on aspects of data collection and data processing, as well as oceanography. (But please see my last paragraph for a suggestion for the future of the discussion...)

Regarding interpolation vs smoothing, it seems to me that part of the challenge in identifying best practices is the simple fact that measurement schemes have changed a great deal through the years. It is also fair to say that different applications demand different methodologies in both observation and analysis. It seems unlikely that things will settle down anytime soon, because whole new classes of observational technologies are arriving like waves on a shore. I think flexibility is the key word.

Based on discussions I've had with a variety of oceanographers about TEOS-10 over the years, I wonder if it's time to consider splitting TEOS-10 into two parts.

  1. Part A This would consist of core facilities that are related directly to seawater properties. These functions would, it seems reasonable to assume, converge to an essentially stable state, with periodic updates to the database being the only significant change over time.

  2. Part B This would be made up of functions dealing with secondary quantities that oceanographers calculate, based partly on quantities from Part A. This would include N^2, R_rho, dynamic height, mixed-layer depth, and so forth. On these calculations, reasonable people are liable to reasonably disagree. Indeed, a given analyist might employ two different methods to a given dataset, when undertaking analyses with different goals. (For example, in my double-diffusion work, I calculate base vertical gradients of salinity, etc. on a large scale, using smoothing splines or low-pass filters, depending on the data source. But then to get interface properties, I use raw CTD data, not decimated to 1dbar and not even converted from conductivity to salinity, because I may need to make adjustments for sensor mismatches. So, that's one person, one dataset, two methods.)

A set of us has discussed the idea of having a web conference call, perhaps annually, to discuss the status of the various software manifestations of TEOS-10. The above-stated scheme is one thing that I would likely bring up in such a discussion. I would argue that the TEOS-10/GSW code be reframed to only deal with PART 1, and that anything related to PART 2 should go into a whole new package/library. An advantage of that would be that PART 1 would vary slowly, obtain wide agreement, and be supported well by peer-reviewed literature, so that it could get periodic approval by the Intergovernmental Oceanographic Commission. (As far as I know, the IOC approval was only for the original polynomials ... maybe I'm wrong on that, though.)

I realize that this comment does not really belong in this individual issue thread. Indeed, it doesn't belong in the TEOS-10/GSW-R repository, because it's a general topic for discussion that goes beyond any of the secondary variants of the base TEOS-10/GSW algorithms. If Paul were to create a new repository on TEOS-10 for general discussion, that's where I should put this comment ... but only if this smaller list think it has enough merit to get a wider airing.

dankelley commented 7 years ago

I updated the source gsw_oceanographic_toolbox.c to the latest from GSW-C, and documented this throughout GSW-R with date- and hashcode-specific comments throughout the documentation. Then I ran rhub::check() and rhub::check_with_valgrind() to see that things seemed okay. Then I ran rhub::check_for_cran() to check on a suite of machine types and R versions. All seems okay, so I'm closing this issue. Thanks, all, for the help. Dan.