apjanke / octave-tablicious

Table (relational, tabular data) implementation for GNU Octave
https://apjanke.github.io/octave-tablicious/
GNU General Public License v3.0
28 stars 11 forks source link

Cleaning `table` classdef methods #115

Closed pr0m1th3as closed 5 months ago

pr0m1th3as commented 6 months ago

Since the table classdef has numerous methods, many of which are kind of unnecessary, I suggest that a good start is the MATLAB documentation itself as a guide of what we could/should keep in the Tablicious package as a function, what should be a table class method, and what we could/should remove. According to this I summarize bellow Functions table, array2table, cell2table, struct2table, table2array, table2cell, table2struct, table2timetable, vartype, convertvars. I suggest that array2table, cell2table, struct2table remain as functions, and table2array, table2cell, table2struct, table2timetable, vartype, convertvars should be methods in table classdef. table2timetable should be there as a method and do nothing (perhaps return an error) until timetables are also supported.

Read and Write Files readtable, writetable, detectImportOptions, spreadsheetImportOptions, getvaropts, setvaropts, setvartype, preview, parquetread, parquetwrite, parquetinfo. I suggest that we let all these to be handled by the io package. Of course, we could work on these, but push patches to the io package instead of implementing these inside the Tablicious package.

Summary Information and Stacked Plot summary, height, width, istable, istabular, head, tail, stackedplot. I suggest that istable and istabular remain as functions (istabular is missing but we can implement it) and summary, height, width, head, tail, stackedplot should be methods in table classdef.

Sort, Filter, and Rearrange sortrows, unique, issortedrows, topkrows, rowfilter, addvars, renamevars, movevars, removevars, splitvars, mergevars, convertvars, rows2vars, stack, unstack, innerouter, addprop, rmprop. I suggest that sortrows, unique, issortedrows,topkrows' should be methods until table class gets merged into Octave core (by that time their functionality could be complemented by the respective core functions and removed). rowfilter should be left for the io package as the rest Read and Write Files functions. The remaining addvars, renamevars, movevars, removevars, splitvars, mergevars, convertvars, rows2vars, stack, unstack, innerouter, addprop, rmprop should be methods in the table classdef.

Join and Set Operations join, innerjoin, outerjoin, union, instersect, ismember, setdiff, setxor. These all should be implemented as methods in the table classdef (some of them are already).

Missing Values anymissing, ismissing, standardizeMissing, fillmissing. These can be handled in the statistics package for the time being, and by the time table class is merged into core Octave, they can also be merged back to core.

Apply Functions to Table Contents pivot, groupcounts, groupfilter, groupsummary, grouptransform, findgroups, splitapply, rowfun, varfun. These should be included as methods in the table classdef.

My suggestion is that all other methods presently available in the table class should be removed and those that are necessary for internal computations for the class to work properly should be made explicitly private.


Andrew's notes

Related tickets:

apjanke commented 5 months ago

Some thoughts so far:

As context, my main priorities for Tablicious are compatibility with Matlab's interface, and that it work with Octave and other OF packages as they are now (including the last version or two of Octave, for folks who haven't upgraded yet), and not relying on changes in future versions. I'm happy to make changes for compatibility with future versions of Octave etc. when they happen.

Functions

table2array, table2cell, table2struct, table2timetable, vartype, convertvars should be methods in table classdef

vartype needs to be a function or classdef because it's called without reference to a table object. It's a thing that is passed into an argument to an indexing operation on a table, so the vartype(...) call doesn't have a table input argument, and it returns an object that needs to stick around independent of a table. (It's supposed to be a classdef instead of a function, really.)

table2array, table2cell, table2struct, and convertvars are already methods in table. table2timetable doesn't exist in Tablicious yet.

Read and Write Files

readtable, writetable, detectImportOptions [...] I suggest that we let all these to be handled by the io package. [...] we could work on these, but push patches to the io package instead of [Tablicious]

The io package may be the best place for these in the long run. But I'd rather bang out my own implementations of them in Tablicious now, and offer to transfer them to io when they're complete enough to be useful. Collaborating with developers in other projects takes more time and work, especially when it's volunteers working on an as-interested basis, and I don't know what io's priorities and schedules are.

Summary Information

istable and istabular remain as functions [...] summary, height, width, head, tail, stackedplot should be methods in table classdef.

height, width, head, and tail need to be functions, since they work on non-table inputs too (like numerics), and don't exist in Octave yet. Similarly, the non-table-input version of stackedplot doesn't exist in Octave or any OF packages, and if I write the table-supporting part of it, adding the numeric-input path doesn't seem like much work, so may as well do it.

summary is a method in table.

Added istabular() and istimetable() functions here.

Missing Values

anymissing, ismissing, standardizeMissing, fillmissing.

I removed these functions, leaving only the ismissing override methods.

My suggestion is that all other methods presently available in the table class should be removed and those that are necessary for internal computations for the class to work properly should be made explicitly private.

Most of the other methods are there for a reason, because they provide functionality that I needed or wanted. It's unfortunate that there are so many methods, especially all the "structural" stuff like shiftdims, but they make table work how I want, and most of them need to be public because they are called by code outside the table class. (Even stuff like shiftdims(), which isn't supported for table arrays, but is there to produce a nice error message and/or prevent the default core Octave implementations of these functions from messing up the table object structures.)

Like, without an override, shiftdim(tbl) does this:

>> shiftdim(tbl, 1)
error: octave_base_value::permute (): wrong type argument 'object'
>>

I think that's less helpful for a user new to Tablicious or Octave than something like this:

>> shiftdims(tbl, 3)
error: Function shiftdims is not supported for tables
error: called from
    shiftdims at line 3215 column 7
>>

Though I think shiftdims is a typo and is supposed to be shiftdim. I'll fix that.

Are there particular methods you would like removed, or particular currently-public methods you would like made private? proxykeysForMatrixes is still public; that may or may not become private, depending on whether I remove support for calling it polymorphically.

pr0m1th3as commented 5 months ago

vartype needs to be a function or classdef because it's called without reference to a table object.

That's correct, I missed that.

height, width, head, and tail need to be functions, since they work on non-table inputs too (like numerics), and don't exist in Octave yet. Similarly, the non-table-input version of stackedplot doesn't exist in Octave or any OF packages, and if I write the table-supporting part of it, adding the numeric-input path doesn't seem like much work, so may as well do it.

If you add these as functions, then they will shadow Octave's core once they are implemented and this can happen before table class is pushed into core.

Most of the other methods are there for a reason, because they provide functionality that I needed or wanted.

There is a difference in whether we want to update a classdef with intent to go into core in the near future or to fit our needs. If you are developing a classdef to meet certain needs, this is great but it will take much more effort to streamline it and test it for being MATLAB compatible before going into core. I am mostly interested in the latter, because once I start adding table support in statistics, I want to be as confident as possible that I have a fully MATLAB compatible classdef to work with so that any future regression are minimized in the future.

I have worked extensively the past few days to fix compatiibility issues of your implementation with MATLAB, including subsasgn and subref methods, disp methods, and various other methods that didn't work as expected such as summary, ros2vars etc. If you feel that you need to keep your customization into the Tablicious package, obviously I cannot object to that, but I would rather keep a "clean" MATLAB compatible classdef inside the statistics package so I can work on the statistics functions which was my initial goal to begin with when I started this discussion about bring table into core Octave.

apjanke commented 5 months ago

Hmm. I'm wondering what making the methods private would do in terms of usability at this point. The funky private methods like proxykeysForOneTable don't have Texinfo help sections, so they don't show up in the generated documentation or a which proxykeysForOneTable call. But they're still visible in methods(). Hidden methods show up there too. And they get completed via tab-completion.

>> tbl = table
tbl =
Empty 0-by-0 table
>> methods(tbl)
Methods for class table:
MY_HIDDEN_METHOD        getProperties           ismatrix                mustBeAllColVectorVars  repelem                 setRowNames             stack                   tail
addvars                 getvar                  ismember                mustBeAllSameVars       repmat                  setVariableNames        subsasgn                topkrows
antijoin                getvars                 ismissing               ndims                   reshape                 setdiff                 subsetrows              transpose
cartesian               groupby                 isrow                   numel                   resize                  setvar                  subsetrows_impl         transpose_table
columns                 grpstats                isscalar                outerfillvals           resolveJoinKeysAndVars  setxor                  subsetvars              union
congruentize            hasrownames             issortedrows            outerjoin               resolveRowVarRefs       shiftdim                subsref                 unique
convertvars             head                    istable                 prettyprint             resolveVarRef           size                    summary                 varfun
ctranspose              height                  isvector                proxykeysForMatrixes    restrict                size_equal              summary_for_variable    varnames
disp                    horzcat                 join                    proxykeysForOneTable    rowfun                  sizeof                  summary_impl            vec
display                 innerjoin               length                  proxykeysForTwoTables   rows                    sortrows                table                   vertcat
end                     intersect               makeVarNamesUnique      realjoin                rows2vars               splitapply              table2array             width
evalWithVars            iscol                   mergevars               removevars              semijoin                splitvars               table2cell
findgroups              isempty                 movevars                renamevars              setDimensionNames       squeeze                 table2struct

>>

Don't know there's much I can do about that by tweaking the access modifiers on the methods. Maybe there's some way of customizing tab completion for a classdef in Octave to get them to not show up there? Overriding methods() might do it, but that sounds a little extreme.

Looks like fieldnames() needs an override, though.

Maybe the most useful thing in terms of documentation would be to remove the Texinfo helptext items for those structural etc. method overrides, so they don't show up in the generated doco at all. (They're already absent from INDEX.)

apjanke commented 5 months ago

If you add these as functions, then they will shadow Octave's core once they are implemented and this can happen before table class is pushed into core.

If they get implemented in core Octave, then I can make a new release of Tablicious which removes them. Or better yet, moves them into a shims/pre-X.Y.Z directory like with mustBeNumeric(), so they're absent when running under the new version of Octave, but still available for compatibility when running under older versions of Octave.

[...] I want to be as confident as possible that I have a fully MATLAB compatible classdef to work with so that any future regression are minimized in the future.

I agree, Matlab compatibility is a pretty high priority here. But in my view, the real measure of Matlab compatibility is in the public or externally visible behavior of table's interaction with those functions. That's what the Matlab interface documentation describes. It doesn't document what the internal implementation level interaction and calls between those functions are, how self-calls work, or stuff like that. Without those functions being available at all, it's kind of hard to assess and test what that interaction would be in the future once those functions arrive.

As a solo project that's still in the 0.x release series, I think it's easier and quicker to make changes in the Tablicious package than in code that's in core Octave or well on its way there. And it seems like the modifications for then pulling table in to core Octave would be mostly deleting the methods which are no longer needed once the core functions support table type inputs for them, which doesn't sound like a very difficult task.

[...] I would rather keep a "clean" MATLAB compatible classdef inside the statistics package [...]

That's reasonable. Tablicious's primary goal is to be usable by end users with the current versions of Octave. To me, that means providing function definitions for table-related functions where needed so that users can write table-using code the same way they would with Matlab, including migrating Matlab code over directly. For example, I want users to be able to write like if istabular (x) instead of if isa (x, 'table') || isa (x, 'timetable'), and that requires a function.

Supplying code for possible importing to core Octave is something I value too, but it's a secondary goal for me.

Maybe I just don't understand Octave's OOP support well enough, but it seems like a separate clean implementation of table would still need most of those methods to function properly?

Or maybe it's that I'm not familiar enough with Octave's builtin functions or their support for extension. Like, if I override size() in table, could it drop the isvector() and isscalar() method overrides? Are Octave's isvector() and isscalar() functions guaranteed to be implemented in terms of calls to size(), and to respect size() overrides on classdefs? I haven't seen stuff like that documented, but maybe I'm looking in the wrong place. Since I'm not sure what the interaction between those functions is (as part of their interface, not just current implementation), I overrode all the related methods to be sure of consistent behavior.

apjanke commented 5 months ago

Some of these are good candidates for moving around, though. There's a few internal-use things that could be made local or private functions, instead of private methods (which are still exposed in methods()), especially if I drop support for polymorphic extension of them (which seems fine at this point). And some methods could be functions, which, while I want to keep around, could go in the +tblish package, which gets them out of the table class's interface, and they won't collide with functions from future Octave or other packages, since I doubt they'll use the +tblish package for their own stuff.

Have a look at the WIP/table-hide-methods branch: https://github.com/apjanke/octave-tablicious/tree/WIP/table-hide-methods

And I removed the texinfo doco sections for some of the private or "structural" methods in table, which will keep them out of the generated documentation.

That should clean up the table interface a bit, in terms of both behavior in code and its appearance in the documentation.

apjanke commented 5 months ago

Oh, the other thing is that the Table I/O stuff I'm writing could be done in a separate Tablicious-IO package, so users/clients could select that independently from Tablicious. E.g. a user could install, or another package could take a (optional) dependency on, Tablicious, but still keep the Table I/O functions and classes out of their namespace if they wanted. That seems like a good idea.

apjanke commented 5 months ago

One other thing to consider – Matlab's table has 131 methods on it, including some math and statistics looking things. (Tablicious's is getting there, at 97 methods, but like you say, there's substantial non-overlap.) They don't all seem to show up in the online documentation, at least not where it's easy to see. If we're concerned about Matlab compatibility, we might want to look at the actual code and class behavior level or lower-level API reference documentation, not just the user-manual-level doco.

image
pr0m1th3as commented 5 months ago

As said previously, my intent is to have a MATLAB alike table class def as a reference to update all the functions in the statistics package. As you can understand, this needs to be a vanilla table, but fully functional in terms of identical behavior. This is what I am currently working on.

I get that you are keen into introducing various bits and pieces in the tablicious package and you prefer handling the tablicious package as a solo project. That's fine, but it just doesn't work very well for me. In statistics package, whoever from the very few few developers, who are contributing from time to time, starts working on a bug fix, a new feature or a function, I just make the assignment and wait for a PR within a reasonable time window, since we are all volunteers and everyone contributes at their own time. Of course, this does not mean that other package maintainers should follow my rule of thumb. But the same applies to me as a contributor.

Since I've started adding BISTs in the classdef, I have made numerous bug fixes and feature updates in order to keep the functionality consistent with MATLAB (at least to the extent of my testing suite). Merging back these changes will disrupt some of the functionality you already have in the Tablicious package. So I think it is best for now to keep them separately and just use my version for the statistics package, since it already relies on its functionality for the summary method, which requires median with "omitnan" option. I think this will allow us both to work independently and share the code at our own discretion on whether it fits our purpose.

apjanke commented 5 months ago

I moved the table grpstats to a tblish.grpstats function, which will get it out of the way of Statistics's grpstats function. It can still be called in Tablicious, but the package-qualification will avoid collisions. Removed the table.resize method entirely.

Moved the resolve* table methods to local/internal functions.

That should clear up some more collisions, and get the table interface size down a bit, specifically removing long "techy-sounding" method names that might confuse users.

apjanke commented 5 months ago

Fair enough. Coordinating development between projects does add overhead.

Closing this out.