JuliaData / DataAPI.jl

A data-focused namespace for packages to share functions
Other
33 stars 13 forks source link

add nrow and ncol #40

Closed bkamins closed 3 years ago

bkamins commented 3 years ago

Following the discussion in https://github.com/JuliaAI/ScientificTypes.jl/issues/137#issuecomment-885258513.

CC @quinnj @nalimilan @ablaom

codecov[bot] commented 3 years ago

Codecov Report

Merging #40 (0e62aad) into main (708b8ba) will increase coverage by 1.28%. The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main      #40      +/-   ##
==========================================
+ Coverage   91.30%   92.59%   +1.28%     
==========================================
  Files           1        1              
  Lines          23       27       +4     
==========================================
+ Hits           21       25       +4     
  Misses          2        2              
Impacted Files Coverage Δ
src/DataAPI.jl 92.59% <100.00%> (+1.28%) :arrow_up:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 708b8ba...0e62aad. Read the comment docs.

nalimilan commented 3 years ago

Makes sense. Maybe we should also say that Tables.jl is the owner of the function, and add a fallback definition there (that would notably work for vectors of names tuples)? I'm also not sure whether we should provide a docstring here, or in Tables.jl instead, given that we don't define any method here (cf. discussion regarding join functions).

bkamins commented 3 years ago

@quinnj - is making Tables.jl an owner feasible? I was thinking about it also, but I was not sure.

An alternative would be to make the default implementation in DataAPI.jl that would return missing (signaling that it is unknown)? Then clearly Tables.jl could add extra methods in cases when nrow or ncol is known.

nalimilan commented 3 years ago

I'd rather throw a MethodError than return missing for unsupported types. The latter could be useful for table objects for which the number of rows cannot be known in advance (e.g. iterators of named tuples).

bkamins commented 3 years ago

OK - makes sense. With the definitions I gave now MethodError is thrown appropriately. So I assume that if we merge and release it Tables.jl should add nrow and ncol definitions to the tables it ships. Right?

bkamins commented 3 years ago

Ah - and I made the docsting, as as opposed to join* case I think we have a fixed API here (as opposed to join* where different packages can probably have different args/kwargs)

ablaom commented 3 years ago

Just FYI: For a Tables.jl implementation one might look at what we have currently in MLJBase.jl (courtesy of @OkonSamuel): https://github.com/JuliaAI/MLJBase.jl/blob/f21d295d3f0714ec8f5dcc93d1aa34ccbf16fdff/src/interface/data_utils.jl#L77 .

quinnj commented 3 years ago

I'm fine with adding these definitions here; I've mentioned elsewhere in various Tables.jl issues, however, that nrow and ncol aren't well-defined for all "tables" (in the Tables.jl source definition). You can have a row-iterator table with Base.SizeUnknown() or a filtering generator of NamedTuples and you won't be able to call nrow on those.

Now, I will say that's just the row-based table case. For column-based tables, we do expect the # of rows to be defined, and Tables.jl has Tables.rowcount internally defined that works on any result of Tables.columns(x).

And so, I'd say I don't think I'm on board with having Tables.jl be the owner, because I'm afraid that will give the impression that nrow is now supported on any table. It makes it tricky though because to rename Tables.rowcount to nrow, it kind of needs to be the fallback definition. It's a little messy.

bkamins commented 3 years ago

So - in summary. What I propose now is just what is needed if I understand the comments correctly. Let us wait for @nalimilan to review.

bkamins commented 3 years ago

@nalimilan - could you please review. After merging this I would make a 1.9 release. Thank you!

bkamins commented 3 years ago

Thank you! Making a release

ablaom commented 3 years ago

cc @OkonSamuel

matthieugomez commented 3 years ago

Would not nrows and ncols be better names? It would be more consistent with Base (e.g. ndims, nfields, etc). (see also https://github.com/JuliaData/DataFrames.jl/issues/2247)

bkamins commented 3 years ago

I think it is too late to change this for DataFrames.jl, and given we will stick to nrow and ncol there it is better to use a common naming scheme across packages.