JuliaAI / ScientificTypes.jl

An API for dispatching on the "scientific" type of data instead of the machine type
MIT License
96 stars 8 forks source link

Exclude certain kinds of dictionary tables from being interpreted as tables #189

Closed ablaom closed 1 year ago

ablaom commented 1 year ago

This PR will:

ablaom commented 1 year ago

According to local testing, this PR will resolve https://github.com/JuliaAI/MLJText.jl/issues/25 .

ablaom commented 1 year ago

@bkamins Be great if you could have a look at this, thanks.

codecov-commenter commented 1 year ago

Codecov Report

Merging #189 (fb34924) into dev (6ab77cf) will increase coverage by 0.04%. The diff coverage is 100.00%.

@@            Coverage Diff             @@
##              dev     #189      +/-   ##
==========================================
+ Coverage   93.50%   93.54%   +0.04%     
==========================================
  Files           6        6              
  Lines         431      434       +3     
==========================================
+ Hits          403      406       +3     
  Misses         28       28              
Impacted Files Coverage Δ
src/ScientificTypes.jl 100.00% <100.00%> (ø)
src/convention/scitype.jl 93.18% <100.00%> (ø)

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

bkamins commented 1 year ago

Apart from a comment related to https://github.com/JuliaData/Tables.jl/issues/309 I have a general design question that I think would be worth discussing.

The issue is that Tables.istable is only a one-way promise:

Check if an object has specifically defined that it is a table. Note that not all valid tables will return true, since it's possible to satisfy the Tables.jl interface at "run-time", e.g. a Generator of NamedTuples iterates NamedTuples, which satisfies the AbstractRow interface, but there's no static way of knowing that the generator is a table.

This means that is we see true we know that something is a table, but when we see false it does not mean it is not a table. This is indeed unfortunate.

The problem is that some MLJ.jl users can use a table for which Tables.istable returns false. Then they have a problem because your current mechanisms will not let them such a table be treated as a table, but rather it will get :other treatment.

You know the MLJ.jl ecosystem much better than me 😄, so I do not feel qualified to give informed proposals what to do with this, but clearly the sentence:

any table type T supported by Tables.jl

is not true currently. It probably should be made more precise (if nothing is changed) to saying that table is only considered something for which Tables.istable returns true except these two excluded cases. The other solution could be that a separate method would be for something that user explicitly wants to be considered as table, and separate for things not to be considered tables (but I am not sure if this is feasible or convinient, it also might be breaking; so to do not treat this suggestion as something strong).

bkamins commented 1 year ago

Given the discussion in https://github.com/JuliaData/Tables.jl/issues/309 and Tables.jl 1.10.0 release the PR should be updated to AbstractString.

ablaom commented 1 year ago

@bkamins Thanks for the review and comments. I have now updated the excluded key type from String to AbstractString.

It probably should be made more precise (if nothing is changed) to saying that table is only considered something for which Tables.istable returns true except these two excluded cases.

I have clarified the ScientificTypes.jl documentation to that effect, and raised an issue to do the same in MLJ docs.

As far as allowing tables in MLJ for which Tables.istable is false, I can't think of anything better than adding a wrapper for objects to declare them as bona fide tables, with Tables.jl methods passing through, which I guess is what you are suggesting. I suppose the main use case in ML for these tables would come from tabular data not fitting into memory. MLJ hasn't really been designed to train using out-of-memory datasets, so this probably fit's into the broader design discussion at, e.g., https://github.com/JuliaML/MLUtils.jl/issues/61 .