Add support for collating sequences on indexed files

ddeclerck commented 6 months ago

This PR adds support for COLLATING SEQUENCE on indexed files, as mentionned in https://sourceforge.net/p/gnucobol/feature-requests/459/.

Note this has. only been done for the BDB backend - that's what our client needs.

ddeclerck commented 6 months ago

Hmm, that broke the NIST testsuite. Investigating further, it seems we should in fact NOT apply the program collating sequence to indexed files that do not have an explicit collating sequence (as opposed to SORT/MERGE files).

See 12.3.5.3 in ISO/IEC 1989:2014 :


11) The alphanumeric program collating sequence and national program collating sequence are used to determine
the truth value of any alphanumeric comparisons and national comparisons, respectively, that are:
a)Explicitly specified in relation conditions.
b)Explicitly specified in condition-name conditions.
c)Implicitly specified by the presence of a CONTROL clause in a report description entry.
When alphabet-name-1 or alphabet-name-2, or both, is associated with a locale, locale category LC_COLLATE
is used to carry out these comparisons.

12) The alphanumeric program collating sequence and national program collating sequence explicitly or implicitly
established by the OBJECT-COMPUTER paragraph are effective with the initial state of the runtime modules
to which they apply. If alphabet-name-1 or alphabet-name-2 references a locale, the associated collating
sequence is defined by category LC_COLLATE in the specific locale associated with that alphabet-name or, if
none, in the locale current at the time the collating sequence is used at runtime

13) The alphanumeric program collating sequence and national program collating sequence are applied to
alphanumeric and national sort or merge keys, respectively, unless the sort or merge collating sequence has
been modified by execution of a SET statement or a COLLATING SEQUENCE phrase is specified in the
respective SORT or MERGE statement.```

GitMensch commented 6 months ago

Found the answer in 12.4.4.3:

If no COLLATING SEQUENCE clause is specified: a) the collating sequence for alphanumeric record keys, both primary and alternate, is the native alphanumeric collating sequence; b) the collating sequence for national record keys, both primary and alternate, is the native national collating sequence.

So this is by default "native", not "program collating". Please re-adjust, sorry if I was misleading here.

BUT: before adjusting the code, please add a minimal version of the failing NIST test (ideally written from scratch) to our internal testsuite; this should fail before your change-adjustment and pass afterwards.

(and note that this is also one of the fixed file attributes, which in theory should be checked - we don't do up to 4.x, but don't check that there either...).

codecov-commenter commented 6 months ago

Codecov Report

Attention: 11 lines in your changes are missing coverage. Please review.

Comparison is base (db7db96) 65.78% compared to head (b594e0e) 65.86%.

Files	Patch %	Lines
cobc/tree.c	78.94%	2 Missing and 2 partials :warning:
cobc/cobc.c	71.42%	1 Missing and 1 partial :warning:
cobc/codegen.c	90.47%	0 Missing and 2 partials :warning:
libcob/common.c	60.00%	2 Missing :warning:
libcob/fileio.c	95.45%	0 Missing and 1 partial :warning:

:exclamation: Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

```diff @@ Coverage Diff @@ ## gcos4gnucobol-3.x #132 +/- ## ===================================================== + Coverage 65.78% 65.86% +0.08% ===================================================== Files 32 32 Lines 59416 59481 +65 Branches 15694 15708 +14 ===================================================== + Hits 39087 39178 +91 + Misses 14311 14283 -28 - Partials 6018 6020 +2 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

ddeclerck commented 6 months ago

I made the necessary changes. I also added a new -fdefault-file-colseq flags that works similarly to -fdefault-colseq but just for files.

As for the tests, I reworked them and included a dummy program collating sequence (taken from the failed NIST test), which was sufficient to cause the testsuite to fail priori to fixing the bug.

GitMensch commented 6 months ago

That sounds all good!

Please check the test coverage of your changes, the codecov bot says:

Attention: 188 lines in your changes are missing coverage. Please review.

188 sounds a bit much.

ddeclerck commented 6 months ago

That sounds all good!

Please check the test coverage of your changes, the codecov bot says:

Attention: 188 lines in your changes are missing coverage. Please review.

188 sounds a bit much.

Fixed ! Just had to resynchronize with upstream. The remaining uncovered lines are either benign or were not covered anyways.

ddeclerck commented 6 months ago

Should be good. Though I left the key collating sequence for later (although I enabled the field to store the file collating sequence).

GitMensch commented 6 months ago

Note: I do have a bigger refactoring of the BDB code which I want to push start of next week, should I wait for this PR to be finished and merged before or would you possibly even prefer if I push the refactoring before? (I just wanted to fix a bug, recognized that for doing so I should backport status 02 complex; then got from one thing to the next... and then did a refactoring, using the specific f->keys[idx] instead of using an index of a subfield and the same for the contained keydesc - the refactor got so big that I should push that separately in any case [the full code of course has a bug somewhere and hides from me])

ddeclerck commented 6 months ago

Note: I do have a bigger refactoring of the BDB code which I want to push start of next week, should I wait for this PR to be finished and merged before or would you possibly even prefer if I push the refactoring before?

I plan to make the final changes to this PR on monday (need to rest this week end) ; if that prevents you from merging, I don't mind rebasing my PR on your changes.

ddeclerck commented 6 months ago

Made the last changes, though there might be another one to do. I'm concerned about the memcmp here (fileio.c: 5679) :

    /* Check record key */
    bdb_setkey (f, 0);
    if (!p->last_key) {
        p->last_key = cob_malloc ((size_t)p->maxkeylen);
    } else
    if (f->access_mode == COB_ACCESS_SEQUENTIAL
     && memcmp (p->last_key, p->key.data, (size_t)p->key.size) > 0) {
        return COB_STATUS_21_KEY_INVALID;
    }
    memcpy (p->last_key, p->key.data, (size_t)p->key.size);

All other memcmp's on keys are fine because they just test (in)equality with 0, but this one tests for > 0. I guess I should replace this with a collation-aware compare, right ?

GitMensch commented 6 months ago

I have a strange feeling now... Can you please verify that:

the testsuite has at least one entry with variable length keys (which test case is that?) for each SEARCH and INDEXED
if missing please add it
check if we do store the variable length in the key fields

I agree that a memcpy should only be used for testing equality or difference, otherwise either a numeric compare (if key field is numeric - the byte order could be non-native!) or a collation aware check needs to be used

ddeclerck commented 6 months ago

I have a strange feeling now... Can you please verify that:

the testsuite has at least one entry with variable length keys (which test case is that?) for each SEARCH and INDEXED

if missing please add it

check if we do store the variable length in the key fields

Doesn't seem to be checked - but shouldn't I address this in a different PR ?

I agree that a memcpy should only be used for testing equality or difference, otherwise either a numeric compare (if key field is numeric - the byte order could be non-native!) or a collation aware check needs to be used

Oh, wait... I believe non-display numeric keys in indexed files are already incorrectly ordered... Just tested with a primary key of usage BINARY-LONG and inserting the keys 4000, 4100 and 4200 : on sequential read I get 4100, 4200 and 4000...

GitMensch commented 6 months ago

That was my feeling. Please address the numeric key sorting, if possible. I think this should work (and hopefully be tested) in SEARCH already.

As both can use the same comparison logic (just using a different set of collations), I think it's useful to test that.

Testing variable length keys belong to this PR if previously possibly completely handled by BDB and now in our own search function.

Note: those have to be ordered as in COBOL, which may be different from BDB.

ddeclerck commented 6 months ago

That was my feeling. Please address the numeric key sorting, if possible.

I see two ways to fix that :

either modifying the bdb_bt_compare function - but that would fix only for BDB...
or always store integer numeric keys in big-endian order (not sure how to deal with IEE754...), but then if I actually translate the keys so they end up stored in the right order, I might as well do that for alphanumeric keys, making this PR less relevant...

In any case, I believe this require some thinking before actually doing that, and I'm not sure this belongs to this PR (as it did not introduce that problem).

I think this should work (and hopefully be tested) in SEARCH already.

Tested, seems to work fine for SEARCH.

As both can use the same comparison logic (just using a different set of collations), I think it's useful to test that.

Could you elaborate ?

Testing variable length keys belong to this PR if previously possibly completely handled by BDB and now in our own search function.

I'm not sure to understand ; could you rephrase that ?

GitMensch commented 6 months ago

either modifying the bdb_bt_compare function - but that would fix only for BDB...

that would be enough for now and is likely "the most portable" solution. But that creates a HUGE problem: if we do that, then the files need to be rebuilt :-/ Note that using the key type and use the appropriate COBOL sort function (if the field structure within ´->app_data[if not done: store a complete cb_field structure in there], then you only need to duplicate that cob_field to then set the two separate fields' data) should directly work for all types. This is effectively what is done forSEARCH(and the internal comparison duringSORT`).

Otherwise we could have small-enough numbers be converted to big-endian binary and bigger ones to be converted to USAGE DISPLAY SIGN SEPARATE LEADING in bdb_savekey, that would also work - but again leads to all files needing a rebuild. As this repacking/unpacking would be done on any key access, this is likely slower than the approach above.

As LMDB does not provide an external compare function, the second solution seems to be the only one that could be applied there (but note that LMDB exists only in trunk and even then is marked as experimental and possibly will be dropped).

Note: with ISAM there are key types that handle this (including IEE754), but we currently only use CHARTYPE, so the same problem seems to exists there. For this part I'll contact Ron.

In any case, I believe this require some thinking before actually doing that, and I'm not sure this belongs to this PR (as it did not introduce that problem).

Hm. Please add a test with expected failure and a comment about the underlying problem. We could otherwise leave that general problem out of this PR, but if a key is purely numeric, there should be no "additional" conversion applied which makes the current scenario worse. That brings us to the good thing: if you implement key collation within the cobc site (you already have the codegen and use of that in libcob) then you can place a check there for a numeric type and just skip generating the collation for that.

As both can use the same comparison logic (just using a different set of collations), I think it's useful to test that.

Could you elaborate ?

SEARCH/SORT already compares "as cob_field" for any numeric type; if the same is applied to the INDEXED comparisons (you need a cob_field in ´app_data` for that), then everything "just works". But as we noted: this may be kept to a separate PR, especially as we have to consider what to do with the existing files (likely we'd have to insert a marker into the file itself, likely by an embedded table, where we record "that's type 2, use different search algorithm"). This is definitely useful, but definitely "quite big" (and yes, should be kept separate).

Testing variable length keys belong to this PR if previously possibly completely handled by BDB and now in our own search function.

I'm not sure to understand ; could you rephrase that ?

BDB compares keys. those are setup from cob_field *component - if this would match the COBOL field, then it may be variable-length; but as far as I now see those are always fixed-length; so the issue is again "likely important, but likely big and belongs to another PR".

So... do you add key collations in cobc now and skip to set the collation if the field is numeric (potentially also skip if it is NATIONAL)? That should provide enough for the original issue.

Yeah, they should always be of identical length in our context (and I made a quick check by adding an assert and running both NIST and the testsuite with no error). Anyway I prefered to handle the case anyways (the comparison function is rather "low-level" and does not really care about our use case).

The new indexed_compare is our specialized function to our case - and we cannot even support two different sizes as we don't have it when the specialized function is used. Therefore: please adjust indexed_compare() to take only a single size, which then also removes the need to use cob_min_int().

For the runtime warning - I think that should be an error, but we don't expect it to be seen by a user so remove the translation part and add a coverage exclusion, like

            /* LCOV_EXCL_START */
            if (k++ == MAX_MODULE_ITERS) {  /* prevent endless loop in case of broken list */
                /* not translated as highly unexpected */
                cob_runtime_warning ("max module iterations exceeded, possible broken chain");
                break;
            }
            /* LCOV_EXCL_STOP */

ddeclerck commented 6 months ago

either modifying the bdb_bt_compare function - but that would fix only for BDB...

that would be enough for now and is likely "the most portable" solution. But that creates a HUGE problem: if we do that, then the files need to be rebuilt :-/ Note that using the key type and use the appropriate COBOL sort function (if the field structure within ´->app_data[if not done: store a complete cb_field structure in there], then you only need to duplicate that cob_field to then set the two separate fields' data) should directly work for all types. This is effectively what is done forSEARCH(and the internal comparison duringSORT`).

Otherwise we could have small-enough numbers be converted to big-endian binary and bigger ones to be converted to USAGE DISPLAY SIGN SEPARATE LEADING in bdb_savekey, that would also work - but again leads to all files needing a rebuild.

So, both solutions would imply a rebuilding of the files. What I like about the first one is that it uses backend's mechanisms intended for this purpose (i.e. custom key compare function) - but it has to be implemented specifically for each backend, and not all these backends expose such features. And what I like about the second one is that it is (I think ?) backend-agnostic.

As this repacking/unpacking would be done on any key access, this is likely slower than the approach above.

I don't think we would ever have to translate the "encoded" keys backward, right ?

As LMDB does not provide an external compare function, the second solution seems to be the only one that could be applied there

Isn't mdb_set_compare what we're looking for ?

(but note that LMDB exists only in trunk and even then is marked as experimental and possibly will be dropped).

Even if slightly off-topic : I really like the idea of an LMDB backend. What would be the reasons for dropping it ? Not performing as expected ?

Note: with ISAM there are key types that handle this (including IEE754), but we currently only use CHARTYPE, so the same problem seems to exists there. For this part I'll contact Ron.

In any case, I believe this require some thinking before actually doing that, and I'm not sure this belongs to this PR (as it did not introduce that problem).

Hm. Please add a test with expected failure and a comment about the underlying problem.

I guess I'll do that.

We could otherwise leave that general problem out of this PR, but if a key is purely numeric, there should be no "additional" conversion applied which makes the current scenario worse. That brings us to the good thing: if you implement key collation within the cobc site (you already have the codegen and use of that in libcob) then you can place a check there for a numeric type and just skip generating the collation for that.

Well, I think this is handled in codegen.c: output_indexed_file_key_colseq :

    int type = cb_tree_type (key, cb_code_field (key));
    cb_tree col = NULL;

    /* We only apply a collating sequence if the key is alphanumeric / display */
    if ((type & COB_TYPE_ALNUM) || (type == COB_TYPE_NUMERIC_DISPLAY)) {
        /* TODO: actually handle record keys (using file key for now) */
        col = f->collating_sequence;
    }

We agree that NUMERIC DISPLAY should be stored lexicographically, right ?

As both can use the same comparison logic (just using a different set of collations), I think it's useful to test that.

Could you elaborate ?

SEARCH/SORT already compares "as cob_field" for any numeric type; if the same is applied to the INDEXED comparisons (you need a cob_field in ´app_data` for that), then everything "just works". But as we noted: this may be kept to a separate PR, especially as we have to consider what to do with the existing files (likely we'd have to insert a marker into the file itself, likely by an embedded table, where we record "that's type 2, use different search algorithm"). This is definitely useful, but definitely "quite big" (and yes, should be kept separate).

Got it. As an alternative to an embedded "marker", how about an extra file for such metadata ?

Testing variable length keys belong to this PR if previously possibly completely handled by BDB and now in our own search function.

I'm not sure to understand ; could you rephrase that ?

BDB compares keys. those are setup from cob_field *component - if this would match the COBOL field, then it may be variable-length; but as far as I now see those are always fixed-length; so the issue is again "likely important, but likely big and belongs to another PR".

So... do you add key collations in cobc now and skip to set the collation if the field is numeric (potentially also skip if it is NATIONAL)? That should provide enough for the original issue.

I think that's what my code does now ; except that it uses the file collation for every key - but I guess with all the things to do/fix we discussed here, I'd better implement complete support for (alphanumeric) key collations right now so we don't have to come back to it later.

Yeah, they should always be of identical length in our context (and I made a quick check by adding an assert and running both NIST and the testsuite with no error). Anyway I prefered to handle the case anyways (the comparison function is rather "low-level" and does not really care about our use case).

The new indexed_compare is our specialized function to our case - and we cannot even support two different sizes as we don't have it when the specialized function is used. Therefore: please adjust indexed_compare() to take only a single size, which then also removes the need to use cob_min_int().

Alright, I'll change that.

For the runtime warning - I think that should be an error, but we don't expect it to be seen by a user so remove the translation part and add a coverage exclusion, like
          /* LCOV_EXCL_START */
          if (k++ == MAX_MODULE_ITERS) {  /* prevent endless loop in case of broken list */
              /* not translated as highly unexpected */
              cob_runtime_warning ("max module iterations exceeded, possible broken chain");
              break;
          }
          /* LCOV_EXCL_STOP */

Will do.

ddeclerck commented 6 months ago

Rebsed, added key collations and a test with expected failure regarding the numeric keys sorting problem. @GitMensch would it be sufficient for now ?

GitMensch commented 5 months ago

while reviewing: I've just checked in the refactoring

ddeclerck commented 5 months ago

All changes accounted for - will merge to SVN if it's okay (I see the tag is already there). (+ rebased)

GitMensch commented 5 months ago

I trust your changes will be fine - as the test is already set: go on with the commit.

Thanks!

ddeclerck commented 5 months ago

Merged on SVN.

GitMensch commented 5 months ago

@ddeclerck Thank you for working on this. In my review I've missed that the new functions were defined globally - that broke the !WITH_DB builds, just fixed that in svn.

ddeclerck commented 5 months ago

@ddeclerck In my review I've missed that the new functions were defined globally - that broke the !WITH_DB builds, just fixed that in svn.

Ah, I overlooked that - sorry. Hopefully this was an easy fix. Thanks.

OCamlPro / gnucobol

Add support for collating sequences on indexed files #132

Codecov Report