apjanke / octave-tablicious

Table (relational, tabular data) implementation for GNU Octave
https://apjanke.github.io/octave-tablicious/
GNU General Public License v3.0
28 stars 11 forks source link

Reconcile with Statistics OF package #108

Closed apjanke closed 5 months ago

apjanke commented 6 months ago

Let's make Tablicious play better with the Statistics Octave Forge (OF) package.

Statistics's maintainer has expressed interest in adding support for table arrays and other stuff in Tablicious. Let's support that.

Clean up function shadowing and conflicts with the Statistics OF package, particularly WRT the missing/fillmissing stuff. To make Tablicious compatible with Statistics so Statistics can use Tablicious to support table inputs.

High priority, since Statistics would like to do this for a coming-soon release.

Work

Work for this is on the bulbasaur-24 branch in this repo.

TODO

References

pr0m1th3as commented 6 months ago

I tried in install Tablicious to help spotting shadowing functions but... On Octave 7.2 (Ubuntu 20.04LTS) pkg install -forge tablicious produces an error

>> pkg install -forge tablicious
warning: function /home/promitheas/.local/share/octave/api-v57/packages/tablicious-0.3.6/endsWith.m shadows acore library function
warning: called from
    doc_cache_create>gen_doc_cache_in_dir at line 146 column 5
    doc_cache_create at line 62 column 12
    install>generate_lookfor_cache at line 840 column 5
    install at line 241 column 7
    pkg at line 612 column 9

warning: function /home/promitheas/.local/share/octave/api-v57/packages/tablicious-0.3.6/startsWith.m shadowsa core library function
warning: called from
    doc_cache_create>gen_doc_cache_in_dir at line 146 column 5
    doc_cache_create at line 62 column 12
    install>generate_lookfor_cache at line 840 column 5
    install at line 241 column 7
    pkg at line 612 column 9

error: '__event_manager_register_doc__' undefined near line 41, column 7
error: called from
    load_tablicious at line 41 column 7
    /home/promitheas/.local/share/octave/api-v57/packages/tablicious-0.3.6/PKG_ADD at line 3 column 1
    doc_cache_create>gen_doc_cache_in_dir at line 146 column 5
    doc_cache_create at line 62 column 12
    install>generate_lookfor_cache at line 840 column 5
    install at line 241 column 7
    pkg at line 612 column 9

Octave 8.2 produces that same error as well

apjanke commented 6 months ago

I removed the fillmissing, rmmissing, and standardizeMissing functions (here), and their specialization methods from table (here). I don't see them left in any other classes.

Added an istable function (here).

Having trouble regenerating the doco after that. Not sure if I'm just out of practice, or things have changed in the macOS build environment since last time.

pr0m1th3as commented 6 months ago

I have a few questions.

  1. How can I install the latest dev from the bulbasaur-24 branch into Octave?
  2. Would you like to to reassign the /docs folder for building online documentation with the pkg-octave-doc package? I can make a fork and work on this. Perhaps, we could relocate the files currently in this folder to another like /documentation. BTW, the current online documentation appears unreachable/broken.
  3. I think we would benefit by adding BISTs and DEMOs in the function files. On the plus side, the DEMOS can become part of the online documentation if built with pkg-octave-doc.
  4. Do you agree to change the @deftypefn instances from e.g.
    ## @deftypefn {Function} {@var{out} =} istable (@var{x})

    to

    ## @deftypefn {Tablicious} {@var{out} =} istable (@var{x})

    so that when users call help it is apparent that the function is part of the Tablicious package? I know that this is minor thing that I have previously adapted in the statistics package, but it helps a lot when functions shadow core functions or functions from other packages.

pr0m1th3as commented 6 months ago

Added an istable function (here).

istable is already a method in the table classdef. Of course, if the input variable is not of class table, then this method cannot be called. So I think it kind of misses the scope of existing. But istable is not of much use inside the Tablicious package either. I noticed the Devs note in below the help text, but IMHO it is not a very good idea to duplicate this function acrros multiple packages because we will end up with a labyrinth of istable functions shadowing one another, depending on the sequence of loaded packages. The most robust way is to check with isa (x, "table") whenever we want to add table support to functions in other packages (this is my approach in statistics anyway).

I noticed that there are several functions in /inst folder which are also available as methods in the table classdef file. Why is this so? It feels kind of a duplication and I can't figure out the benefits for this. Am I missing something? (in terms of functionality)

apjanke commented 6 months ago

I have a few questions.

  1. How can I install the latest dev from the bulbasaur-24 branch into Octave?

I don't know if Octave's pkg can install from git repo branches, and I'm not building "head of branch" tarballs to install from. But you mostly don't need to do a full package installation: the only compiled oct/mex file in Tablicious is for time zone conversions inside datetime, so if you're not using that, you can run it directly from the repo. Clone the Tablicious git repo to your local computer, check out the bulbasaur-24 branch there, and use addpath to add its inst/ directory to your Octave path. That's how I mostly run it myself.

  1. Would you like to to reassign the /docs folder for building online documentation with the pkg-octave-doc package?

Maybe! The pkg-octave-doc package is new to me, and looks useful. I'll check it out. I'm currently generating the documentation with some Perl scripts that I copy-pasted from another Octave package a few years back, and using them is a little rough. (E.g., make doc is currently broken in my build environment and I'm not sure why; that's why some of the online and generated doco is out of date here.)

I'd rather do this myself than take a PR for it. I like to be pretty hands-on with the tooling for my packages. I'll have a look at it soon.

Perhaps, we could relocate the files currently in this folder to another like /documentation

It lives in /docs now because that supports publishing to GitHub Pages from the master branch, instead of having to maintain a separate branch or repo for the GH Pages site, which is more work. This is just a convenience; I've come around to the view that what's published on GH Pages or the other online site should be the doco for the latest stable release, or versioned doco for each release. Doing it this way from master only publishes the doco for the latest dev release, and doing versioned doco within the repo adds clutter to both the repo files and the git commit history. If Tablicious gets popular, I'll probably break the documentation site out to a separate repo, at which point the /docs folder could be moved within this repo.

BTW, the current online documentation appears unreachable/broken.

Oh, thanks for catching that! Was a couple bad links in the main repo README and the doco index, plus GH Pages configuration. I fixed it (I think) and it's mostly working for me now. May take an hour or two for DNS changes to propagate so it works for everyone.

I think we would benefit by adding BISTs and DEMOs in the function files.

Oh, yes. Should definitely have tests and demos. Just haven't found the time to do those yet.

Do you agree to change the @deftypefn instances from [{Function} to {Tablicious} ...] so that when users call help it is apparent that the function is part of the Tablicious package?

Oh, hmm, I'd never seen that before. I'll have a look at how you did it in Statistics. Making it clear that these functions are part of an OF package instead of core Octave seems like a good idea.

Here's a separate issue for these documentation-related things: https://github.com/apjanke/octave-tablicious/issues/110

apjanke commented 6 months ago

So I think it kind of misses the scope of [an istable function] existing.

So, the M-code doco doesn't come right out and say this, but I think there are two reasons for istable() to exist as a function and not just a method.

  1. It allows classes or types to declare that they implement the table or tabular interface in a duck-typing sense, even if they don't subclass @table, and can be coded against as such.
  2. Matlab coders just like saying isblah(x) instead of isa(x, 'blah'), like with isstring() or isdouble(). Maybe because they're scared of OOP.

The first thing – declaring duck-typing-style interface compatibility with table – lets either the core language or users to define new classes which work like tables, and can be supported by functions that accept table array inputs, but don't inherit from @table itself as subclasses of it. For example, one could define a table-like class that is a remote proxy for a SQL table on a database server, or a Tall Array style datasource for some HDF5 or Parquet files on disk, or just an alternate implementation of an in-memory table array because you have special storage needs, or a different idea for implementing joins, or just think you can do it better than I did.

In these cases, if functions (like in Statistics, or other packages) that had optional table support in terms of the arguments they accept were to check for table-ness with istable like this, instead of isa(x, 'table') in the same place:

function blah(x)
  % Works on numerics or tables
  if istable(x)
    % ... do table-y stuff ...
  else
    % ... do number-y stuff ...
  end
end

...then those functions could work with alternate table-like implementations (concrete classes) without having to know about them in advance.

Now, I doubt anyone using Octave is going to actually want to do that any time soon. That's not a huge motivation for Tablicious per se, just my view of why istable() exists in Matlab in the first place.

The main reason I want istable() in Tablicious now is for Matlab compatibility, so users with existing M-code scripts and functions that are already calling istable() can bring them over to Octave + Tablicious and have them work without having to make source code modifications to convert all the istable(x) calls to isa(x, 'table').

it is not a very good idea to duplicate this function acrros multiple packages because we will end up with a labyrinth of istable functions shadowing one another, depending on the sequence of loaded packages.

Yep, it's a mess. Just the best I could come up with here. As long as all the istable.m implementations are almost exactly the same, it'll work fine and doesn't matter who shadows who. And istable seems simple enough that that may be feasible, but who knows.

If core Octave provided an istable() function that just always returned false, and then classdefs could override it with methods, this would work fine without packages having to supply an istable.m function definition. But currently, calling istable(x) raises an "istable undefined" error instead of returning false (if x is not a class that defines an istable method).

The "clean" modular way to do this would be to separate istable and other stuff that defines just the interface for tables and the means of detecting them out to a minimal "Service Provider Interfave" package, named like TableApi. Then Tablicious, Statistics, and any other package that wanted to support tables but not have a hard dependency on them could take a non-optional dependency on TableApi, get access to a single consistent definition of istable(), and then have an optional dependency on or compatibility with Tablicious, and client code could install Tablicious (or another implementation provider package) if they wanted to actually use tables in their code. But that approach is kinda complex and software-engineering-y, and I bet Octave folks would hate it.

The most robust way is to check with isa (x, "table") whenever we want to add table support to functions in other packages

Yep, that should work fine. Statistics can ignore the existence of istable entirely if it wants, not provide its own istable.m, and just do isa(x, 'table') everywhere.

I noticed that there are several functions in /inst folder which are also available as methods in the table classdef file. Why is this so? It feels kind of a duplication and I can't figure out the benefits for this. Am I missing something?

The methods provide class-specific specializations of those functions, and functions provide more general (and possibly degenerate) implementations that work on inputs other than those classes which define methods of the same name. It's the general pattern of having functions which are specialized by method overrides. And it's also possible that in some cases I forgot about one or the other implementation, and there's duplication which can be removed.

Any particular occurrences of that which are bothering you, or you're wondering about?

pr0m1th3as commented 6 months ago

I made a personal repository of Tablicous and will try to build online documentation with pkg-octave-doc so we can see how it looks and if you like it then we can merge the changes back to yours. I am only working on the table classdef file to add functionality for tbl = table ('Size', sz, 'VariableTypes', varTypes) which is currently not supported. Add BISTs and DEMOs, and then build the online documentation for this to see how it will come out.

apjanke commented 6 months ago

Say, @pr0m1th3as, I think I'm going to rename this repo's master branch to main, bringing it in line with current prevalent Git and GitHub usage. Would that cause you any problems if I did so today? Would only affect you if you have existing clones or forks of the repo.

apjanke commented 6 months ago

...will try to build online documentation with pkg-octave-doc... working on the table classdef file to add functionality for tbl = table ('Size', sz, 'VariableTypes', varTypes)

Cool; both those would be useful!

pr0m1th3as commented 6 months ago

Say, @pr0m1th3as, I think I'm going to rename this repo's master branch to main, bringing it in line with current prevalent Git and GitHub usage. Would that cause you any problems if I did so today? Would only affect you if you have existing clones or forks of the repo.

No, its ok. I will make a new clone as needed and merge any changes locally before making a PR to your repo. As said above, I am only making changes in the table.m classdef file, so anything else you do should not make it difficult to merge back any changes in table.m.

pr0m1th3as commented 6 months ago

Any particular occurrences of that which are bothering you, or you're wondering about?

I only spotted the array2table, cell2table, struct2table, and pp functions. Although the latter is not technically shadowing the prettyprint method

apjanke commented 6 months ago

the array2table, cell2table, struct2table, and pp functions

Those are all just functions. There are no method overrides for them in Tablicious's classes, IIRC.

And I should probably remove pp. That's a command I like to use for interactive convenience, but it's not an established function elsewhere, and Tablicious probably shouldn't be grabbing a short two-letter global name like that.

pr0m1th3as commented 6 months ago

Those are all just functions. There are no method overrides for them in Tablicious's classes, IIRC.

You are right, it's `pp' that is a method.

BTW, does table allow for nested tables?

pr0m1th3as commented 6 months ago

In the string class, we need to change the constructor to identify missing cases properly. I will make a separate PR for this. I also spotted a bug in string.m at line 307. It will be included in the PR

apjanke commented 6 months ago

No, its [([renaming master to main])] ok. I will make a new clone as needed [...]

Renamed master to main just now. Please make a fresh clone and fork (and/or update the branch names and tracking configuration in your current clone) at your convenience.

BTW, does table allow for nested tables?

Not currently. It ought to, but I haven't been able to get that working correctly with various indexing operations ((), {}, and .) around nested tables, particularly chained indexing. It's related to subsasgn/subsref somehow, but I don't know if the problem is in my own code in the implementation of the override methods for those, or some issue with core Octave's support for them. Not sure exactyly where I left this off last time I was working on it.

Here's a ticket for nested table support: https://github.com/apjanke/octave-tablicious/issues/22

need to change the constructor to identify missing cases properly [...]

Oops, yeah. Numeric NaNs and NaT datetimes should convert to missing string values. (And really, that isnan/isnat check should be done first, and the datestr or num2str call not even be done on NaN/NaT input elements.) Not sure how I missed that.

Note that some of the code in the string constructor may be redundant (but consistent) with that of string() methods defined on classes like datetime or categorical. That's intentional: I designed the classes in this package to be somewhat "severable", such that you could pick up the implementation of one class and move it to another OF package or to core Octave and it would work with an alternate implementation of its complementary classes there, or with one to be supplied in the future. It's an attempt to make Tablicious friendly to rearranging things w/r/t other packages, like that "table support in core Octave" being discussed over on Discord. Table arrays, strings, and datetime aren't really a natural combination for a single package, aside from that they're all things I use together with tabular data analysis and all absent from core Octave.

a bug in string.m at line 307.

D'oh! Yup. (Adequate tests like you recommend could have caught that.) Think a good approach there is to just have the string.cell method call string.cellstr, in line with how string.cell's helptext talks it returning the same as string.cellstr and just being there for interface compatibility.

apjanke commented 6 months ago

Okay, the generated doco is building again, on the bulbasaur-24 branch. Have a look there and in https://github.com/apjanke/octave-tablicious/issues/110.

I rebased and squashed the bulbasaur-24 branch. You'll need to rebase your changes on that fresh version of it if you're coding on top of that branch.

I'll merge it to main pretty soon, since publishing to Tablicious's GH Pages currently can only be done from main, and then update the GH Pages site.

pr0m1th3as commented 6 months ago

I am struggling to make a PR for string.m class due to continuous changes somehow messing up the bulbasaur-24 branch on my side. Can I just make a PR directly to main for the sting.m file? Update, I just made a PR on the bulbasaur-24 branch through the GitHub web interface

apjanke commented 6 months ago

Thanks! Looks like this PR is based on the right state of the bulbasaur-24 branch and can merge cleanly. I'll have a look through it.

One issue - this PR mixes a few different and IMHO kind of independent things, which I'd like to make independent choices on whether to accept, and also to have them be distinct commits in the Git commit history. (For readability when skimming the commit history or when examining a diff for one commit.) If you'd like to get author credit for them, could you split them up in to separate PRs? Items I see:

I'll leave the bulbasaur-24 branch stable with no further rebases tonight to make it easy for you to do additional PRs here if you want. (I'm in US Eastern Time.)

In the mean time, I'll drop a couple comments on this existing PR.

pr0m1th3as commented 6 months ago

If you'd like to get author credit for them,

That's ok, merge them as you see fit. I don't have a git history for this, because my local repo was broken due to the repeated re-basing between branches.

apjanke commented 6 months ago

Cool. I'll just grab these code changes piecemeal and merge some of them in, then.

Thanks!

apjanke commented 6 months ago

I pulled in the content from your PR in a few commits on the bulbasaur-24 branch (here, and also here). Took everything but the Changing {Function} to {Tablicious}; gotta think about that one a bit.

Thanks!

apjanke commented 6 months ago

Okay, I think I've got all the stuff that conflicts with Statistics at a basic level removed. It's merged to main. See anything left that will cause you major problems here? Think I'm getting close to an 0.4.0 release that will pick these changes up.

apjanke commented 6 months ago

Add BISTs and DEMOs [...]

Say, can you or somebody point me at some examples or doco on how to add DEMOs to modern Octave code? This is a good idea but I don't really know how to do it.

pr0m1th3as commented 6 months ago

Add BISTs and DEMOs [...]

Say, can you or somebody point me at some examples or doco on how to add DEMOs to modern Octave code? This is a good idea but I don't really know how to do it.

See the ClassificationKNN classdef in the statistics package, which contains BISTs for both the constructor and the method, as well as 2 demos. What I do in statistics, is that I first add the demo code, then add BISTs related to output testing and at the end I add BISTs for input validation. e.g.

...
endfunction

%!demo
%! ## Plot a sinewave
%! x = [1:0.1:10];
%! plot (x, sin (x));

%!demo
%! ## Another demo
%! ...

## Test output
%!test
%! x = 1;
%! y = 2;
%! assert (sum (x, y), 3);

%!assert(sum(1, 2), 3)

## Test input validation
%!error<somefunction: some error> somefunction (bad input)

Note that assert inside a test block musthave a space before the %!. Without a space (like in %!assert) is considered a new separate BIST. If you want to reuse input data for more tests, you can use

%!shared x, y
%! x = 1;
%! y = 2;
%!test
%! assert (sum (x, y), 3);
%!test
%! assert (sum (x, x), 2);

%!assert (sum (y, y), 4);
apjanke commented 5 months ago

Okay, I think I've got the Statistics compatibility stuff about done, fixed up some other issues, and am about ready to cut a Tablicious 0.4.0 release. I merged most of my pending code to main, since I had several changes piling up that interacted with each other.

Is there anything left here that presents a major Statistics interoperability issue that we should get done before Tablicious 0.4.0?

I'm going to put BISTS and demos down the road a bit to 0.5.0; that looks like a larger effort. Handing it under this other ticket:

pr0m1th3as commented 5 months ago

Can you wait a couple of days? I have been working full time the pas 5 days on the table class and I am getting close to submit a PR on it, which includes a lot of bug fixes, and much enhanced compatibility. Ideally, I would need an extra 3-4 days to add all the relevant BISTs as well and cleanup the code base and coding style of earlier code that I haven't tested yet.

apjanke commented 5 months ago

Closing as done, since Statistics decided to go with their own table implementation: https://github.com/apjanke/octave-tablicious/issues/115#issuecomment-1925874715.