EcoJulia / EcoBase.jl

MIT License
16 stars 2 forks source link

Synchronizing Table-like stuff #22

Open kescobo opened 3 years ago

kescobo commented 3 years ago

Purpose

In an effort to improve cross-ecosystem compatibility, it would be nice to make table-like data structures more interoperable. My view of the ecosystem is quite narrow - I'm really only aware of ComMatrix from SpatialEcology.jl and my own CommunityProfile from Microbiome.jl which took quite a bit of inspiration from the former. I also haven't used ComMatrix in the last year or so as I was trying to iterate quickly in Microbiome.jl.

cc @mkborregaard

Current advantages of CommunityProfile

julia> using Microbiome

julia> s1 = MicrobiomeSample("sample1")
MicrobiomeSample("sample1", {})

julia> s2 = MicrobiomeSample("sample2");

julia> set!(s1, :type, "stool")
MicrobiomeSample("sample1", {:type = "stool"})

julia> set!(s1, :age, 37)
MicrobiomeSample("sample1", {:type = "stool", :age = 37})

julia> sp1 = Taxon("Bifidobacterium_longum", :species)
Taxon("Bifidobacterium_longum", :species)

julia> sp2 = taxon("s__Echerichia_coli")
Taxon("Echerichia_coli", :species)

julia> cm = CommunityProfile([0 1; 3 4], [sp1, sp2], [s1, s2])
CommunityProfile{Int64, Taxon, MicrobiomeSample} with 2 features in 2 samples

Feature names:
Bifidobacterium_longum, Echerichia_coli

Sample names:
sample1, sample2

julia> cm[r"Bifido", :]
CommunityProfile{Int64, Taxon, MicrobiomeSample} with 1 features in 2 samples

Feature names:
Bifidobacterium_longum

Sample names:
sample1, sample2

julia> metadata(cm)
2-element Vector{NamedTuple{(:sample, :type, :age), T} where T<:Tuple}:
 (sample = "sample1", type = "stool", age = 37)
 (sample = "sample2", type = missing, age = missing)

Current advantages of ComMatrix (that I'm aware of)

Current incompatibilities

richardreeve commented 3 years ago

I've never looked at Microbiome.jl before, but I think there's a bit of incompatibility going on with the underlying EcoBase interface... looking briefly at Microbiome, it seems like it can (maybe?) use types that offer that interface in places (in particular here, but it doesn't offer it itself (I think?). If it did, a lot of these currently incompatibilities might go away, and you could do the plotting, etc. directly with Microbiome types. Diversity.jl on the other hand implements the EcoBase interface here, for instance, as does SpatialEcology.jl in a variety of places.

More generally, I think that the idea of a common way of actually storing the abundance data - or do you just mean a common tables interface, I wasn't sure? - may not work in practice. Diversity.jl stores abundances as an AbstractMatrix subtype directly in a Metacommunity object, whereas EcoSISTEM.jl stores it in two ways. For simple multithreaded code, it stores it in a GridLandscape object, whereas for multiprocess (MPI) code, it stores it in an MPIGridLandscape object, because the abundance matrix itself is distributed across multiple nodes. Because they all (I hope!) satisfy the EcoBase interface, then everything should just work across the ecosystem, and you can use the SpatialEcology plotting and so on directly irrespective of the underlying storage type. However, the last (MPI) one in particular has no flexibility in how storage is implemented to make the inter-process communication efficient.

If you are just proposing a common interface, and not a common storage mechanism, then that's different, but I'm not sure what interface you're proposing - do you just mean implementing the Tables.jl interface? If so, what does implementing that involve? If it's simple and makes sense it might just be something that can be implemented directly it terms of the EcoBase primitives, so no-one has to do anything to get it to work?

kescobo commented 3 years ago

looking briefly at Microbiome, it seems like it can (maybe?) use types that offer that interface in places (in particular here, but it doesn't offer it itself (I think?)

This seems entirely plausible - I didn't do much testing. Come to think of it, do we have a pre-made set of tests that check for compatibility? That might be a nice way to solidify the interface and make it easier to check.

I think that the idea of a common way of actually storing the abundance data - or do you just mean a common tables interface, I wasn't sure? - may not work in practice

I don't mean that they all need to have the same representation or use the same type specifically, I mostly mean that it would be nice to re-use functionality where possible, and try as much as we can to make them inter-convertible.

I think that the idea of a common way of actually storing the abundance data - or do you just mean a common tables interface, I wasn't sure? - may not work in practice

Maybe I'm only re-proposing EcoBase :laughing:. I am not nearly as up on the rest of the EcoJulia landscape as I should be, it's entirely possible that it's only me that needs to do any work. The impetus for this issue is that I used to use ComMatrix, but wanted some things that it didn't have, so I split off and made my own type because (a) I wasn't super familiar with SpatialEcology internals, and (b) I wanted to be able to experiment and break stuff without needing to burden @mkborregaard every time I made changes. Now, I'd like to come back to being more compatible. As I say, it may be that all of the work is on my end.

do you just mean implementing the Tables.jl interface? If so, what does implementing that involve?

After banging my head against it for a bit, it turns out to be pretty simple. You can be a Tables source, or sink, or both. I've only implemented the source bit, since that was easier and all I wanted for my use-case. To be a source, all you really need is to be able to generate an iterator of named tuples, where the keys are column names (you can implement your own row types too, but a vector of named tuples is the proto-table).

If it's simple and makes sense it might just be something that can be implemented directly it terms of the EcoBase primitives, so no-one has to do anything to get it to work?

I think we could definitely implement a fall-back interface on the primitives, which could then be modified as needed by other packages.

richardreeve commented 3 years ago

Cool. That all sounds good to me. I think what I understood originally did sound a bit like a re-proposal of EcoBase, but in fact I think that adding in some tests that answer "Do I implement EcoBase?" would be really helpful - we could even think about it in terms of sources and sinks like the Tables interface you describe. And adding in the core Tables interface through the current EcoBase primitives would be really nice too. Then we can have a think about whether there are enough commonalities in the implementations to thing about common storage mechanisms - my feeling is that if we can interoperate anyway, it may not be a high priority though.

There's another suggestion on Zulip that we think about providing the same interface as a trait-like thing rather than imposing inheritance on it, which could tie in nicely with providing the tests. I think the idea would be that if you did the inheritance, you wouldn't need to worry about the traits, but you could provide them instead...

mkborregaard commented 2 years ago

Sorry guys, I've been busy, will take a look