EcoJulia / SpatialEcology.jl

Julia framework for spatial ecology - data types and utilities
Other
58 stars 7 forks source link

[WIP] Tables.jl Interface #37

Closed kescobo closed 3 years ago

kescobo commented 5 years ago

This is a WIP effort started at the JuliaCon hackathon to implement a Tables.jl interface for ComMatrices (and other types?). I got some help from Jacob Quinn to get started, now I just need to go through and fix what's breaking.

Addresses #23 (I think)

Ultimately, my plan is to remove the dependency on DataFrames and get everything working on Tables (if that makes sense to you @mkborregaard )

mkborregaard commented 5 years ago

Thanks @kescobo , I'm currently in Ecuador but will try to find time to look at this. There are a number of things - some I think belong in EcoBase, others should be more tightly integrated with the existing framework. Is the iteration interface part and parcel of the Tables one, or can they be conceived differently?

kescobo commented 5 years ago

My understanding is that you can implement a by-row interface (which is the iteration interface), or a by-column interface, or both. I started out trying to do both, though to be honest most of what's written here was @quinnj typing at my computer 😆 .

I think we still need to think about how to include thingnames in the schema for the row interface, so if it makes more sense to you, I can start out with just the column interface, get that working, and then move to the by-row interface. TBH I haven't looked at this since the hackathon, and I didn't 100% follow what Jacob was doing, but I can spend a bit more time on it next week, and you can chime in as you're available. There's certainly no rush on my end, so don't worry about taking time while you're in the field.

mkborregaard commented 5 years ago

Cool. Yeah I have generally been planning a singlecommunity and singlespecies type (or place and thing) type for many things, so it's essentially that it would be obvious to have that working right from the start (be useful to have input from @richardreeve here).

richardreeve commented 5 years ago

Sorry for the delay in getting back to you - I've been away. I haven't had much time to look at this, but it does seem like something very handy to have in EcoBase if it were possible and not too concrete for that package (which it may be!).

On the singlecommunity comment, I have OneCommunity <: AbstractPlaces subtype in Diversity.jl:src/Partition.jl - I couldn't directly inherit from or use a singlecommunity/OnePlace type in EcoBase.jl though, as all of my places types actually inherit from AbstractPartition <: AbstractPlaces...

kescobo commented 5 years ago

No worries - this is a slow-moving issue 😆

@mkborregaard Do I understand correctly that you'd like to implement that before this goes forward? Or is it that you want this in place to develop those?

TBH, I still don't completely understand how this works, but I was thinking of trying to implement it for SparseArrays first (which should be an easier case) and making a PR to Tables.jl. That should help me to understand, and would probably make implementing the types here more straightforward.

mkborregaard commented 5 years ago

Yes, I will implement that and then let's keep this open, it'll probably be a small change to this PR once that's done.

kescobo commented 3 years ago

Sorry for letting this languish so long. I got stuck again with DataFrames upgrade due to compat for this package, and so I decided to try to figure it out. Since I'm not familiar with all of the types here, and since it looks like you use DataFrames in other places internal to your types, I decided to try it out in Microbiome with a new AbstractAssemblage. You can see it on this branch, and the important bits for Tables interface are here:

AbundanceTable is basically the same as a ComMatrix

## -- Tables Interface -- ##

Tables.istable(::AbundanceTable) = true
Tables.columnaccess(::AbundanceTable) = true
Tables.rowaccess(::AbundanceTable) = true

Tables.getcolumn(at::AbundanceTable, i::Union{Int, Symbol}) = at[:, i]
Tables.getcolumn(atr::AbundanceTableRow, i::Union{Int, Symbol}) = atr[i]
Tables.columnnames(at::AbundanceTable) = [:features, Symbol.(name.(samples(at)))...]
Tables.columnnames(atr::AbundanceTableRow) = keys(atr.cols)

function Tables.schema(at::AbundanceTable)
    elt = eltype(abundances(at))
    coltypes = [eltype(features(at)), (elt for _ in 1:nsamples(at))...]
    return Tables.Schema(Tables.columnnames(at), coltypes)
end

Tables.columns(at::AbundanceTable) = (; (col => at[:, col] for col in Tables.columnnames(at))...)
Tables.rows(at::AbundanceTable) = [at[i, :] for i in 1:nfeatures(at)]

Having gone through this process with a type I built myself, I now think I understand better how I would go about doing it for a ComMatrix, but I wonder if we should think about making this part of the interface in EcoBase. I think it would require:

and not a lot else.

kescobo commented 3 years ago

So, I changed my mind again, and decided to go with AxisArrays (via the AxisIndices.jl package). I think I recall you talking about doing this with ComMatrices at some point, and I have to say, it was pretty straightforward. Adding Tables integration was a tiny bit of extra work, since AxisIndices does not implement the Tables methods, but I can now have row and column headers as a custom type, with table-wide or sample (site) specific metadata, and index into the table with strings or numbers for rows and/or columns.

I don't have all of the view machinery you do here, but I rarely end up using that in practice, I think, and it would probably be straightforward to build that in too.

I think I'm going to close this PR for now, since I'm not sure where you want to go with ComMatrices, but I'd be happy to take on building the Tables interface for you once you decide, now that I understand better how it works.

mkborregaard commented 3 years ago

So have you moved away from ComMatrix again in MicroBiome? Do you think what you've built could be a useful drop-in replacement here?

kescobo commented 3 years ago

So have you moved away from ComMatrix again in MicroBiome

Yes. Not for any principled reason, I just wanted to experiment with some stuff, and using my own package to do that made more sense to me.

Do you think what you've built could be a useful drop-in replacement here?

Not quite drop-in - it lacks view machinery, and I'm certain there are other things that aren't as polished. Another big thing is that I don't have any other data structures other than the ComMatrix-like one, so the machinery you have for spatial information and stuff is absent. That said, I think it might be pretty easy to add that as a separate axis

But it's close. If you're interested, I could factor stuff out into EcoBase.jl or some other shared package. Nice thing about AxisArrays is that the whole issue of what to call the axes (even the number of dimensions) doesn't really matter.

mkborregaard commented 3 years ago

I think an axisarray solution would make sense for commatrix. Would be cool to hear @rafaqz opinion here too

rafaqz commented 3 years ago

I dont know what ComMatix does, but you could use DimensionalData.jl and get the Tables.jl interface for free. And spatial data from GeoData.jl would be supported too.

You can also extend AbstractDimArray if you needed to add things to it, or just put extra data in the metadata field.

I wouldnt use AxisArrays.jl, there are a lot of reasons people have written 5 competing replacements for it!

mkborregaard commented 3 years ago

So apart from being the author do you have a compelling reason to prefer DimensionalData out of those five replacements ? :-)

kescobo commented 3 years ago

The primary things I want out of a data system are

  1. Ability to index using numbers or strings/symbols
  2. Ability to index SparseArrays
  3. Ability to use custom types as index

And not much else. Current implementation works for these things (it's using AxisIndices and NamedDims specifically, which aren't raw axis arrays if that makes a difference). That said, I don't feel super strongly about which representation we use - if we come to a consensus on what the ecosystem is going to use, I'm happy to use that.

Should we take discussion to zulip or to a new open issue?

rafaqz commented 3 years ago

All of those points 1:3 work, although I dont think that is unique.

Compelling reasons:

Otherwise AxisIndices/ NamedDims is fine.

DD also has the most stars now. Probably because of GeoData.jl integration as people seem to star and use both.

rafaqz commented 3 years ago

Although I think you might want a different kind of table to DD - its more designed for tables of stack layers.

But using a categorical dimension as table columns could actually be generalised, which I hadnt thought about before now.

mkborregaard commented 3 years ago

That sounds very convincing. A ComMatrix is just a Matrix where we know what the axes are with lots of dispatches defined