EcoJulia / Diversity.jl

Julia package for diversity measurement
http://docs.ecojulia.org/Diversity.jl/
BSD 2-Clause "Simplified" License
34 stars 9 forks source link

A common base for SpatialEcology.jl, Diversity.jl and Microbiome.jl? #23

Closed kescobo closed 5 years ago

kescobo commented 6 years ago

I'm not sure if this is the best place to start, but I've been following the conversation on BioJulia/Phylogenies.jl#1 and I thought it might be worth starting a conversation (inspired by @mkborregaard 's comment somewhere... maybe discourse?) about moving some of the functionality I've been developing in Microbiome.jl to a more generic package. My hope for that package is ultimately to be a sort of wrapper around functionality pulled in from a lot of other places.

Anyway, I implemented principal coordinate analysis using a distance matrix as input, as well as some plot recipes to make them pretty. Would this be the right package to add them in to? Happy to go with a different implementation than I have, I just wrote what worked for me at the time.

mkborregaard commented 6 years ago

Nice. I don't know if it necessarily needs to be consolidated into one package, though I definitely think it makes sense to put things that are relevant to more than just microbiology with other general ecology functions. But what would be cool is to agree on a common set of types, so that the same data object can be used with all the different packages that do ecology.

For instance Diversity's MetaCommunity essentially tries to do the same thing as SpatialEcology's Assemblage, and I'd really like to consolidate them - I will visit Richard as soon as we find a time where both are available :-) SpatialEcology tries to only be a package that defines basic types, constructors, and view-/indexing operations on them - the idea is that other ecology packages should be able to depend on it cheaply and define operations easily.

I think your AbundanceTable is essentially the same as SpatialEcology's ComMatrix{Int} type. If you'd be willing to consider using that type instead I'd be happy to make a PR with the change for your review. It would give a nice synergy I think.

kescobo commented 6 years ago

@mkborregaard I agree 100% with interoperability, and am happy to consider a PR. One wrinkle is that I need abundance tables to be able to take Floats as well - we often express these things as relative abundance (eg between 0 and 1).

kescobo commented 6 years ago

Nice. I don't know if it necessarily needs to be consolidated into one package, though I definitely think it makes sense to put things that are relevant to more than just microbiology with other general ecology functions.

I didn't mean to say that everything I'm developing for Microbiome.jl should be ported here, it's just that PCoA seems to make sense with Diversity.jl. Though another option is to put it in MultivariateStats.jl, since they already have PCA and LDA....

mkborregaard commented 6 years ago

Interesting, I never considered that. Should be easy to add. Would it make sense when implementing that to enforce 0 <= x <= 1?

mkborregaard commented 6 years ago

👍 for MultivariateStats and then dependence on that from ecology packages but that's of course up to Richard in this case.

kescobo commented 6 years ago

Would it make sense when implementing that to enforce 0 <= x <= 1

Not sure - I often import tables that are expressed as percent relative abundance (eg between 0.0 and 100.0), but I think that's kind of silly and it's trivial to transform that to make it a fraction instead on import (which is what I do anyway). I've already got a function that takes such a table and mutates it from counts or whatever to relative abundance as well.

👍 for MultivariateStats and then dependence on that from ecology packages

The more I think about it - the more I think this makes the most sense, presuming the folks at MultivariateStats are game. I'll leave this open pending discussion there and feedback from @richardreeve

kescobo commented 6 years ago

https://github.com/JuliaStats/MultivariateStats.jl/issues/58

richardreeve commented 6 years ago

Sorry for the delay - I'm on a ropy connection in Tanzania at the moment! I agree with you both that JuliaStats/MultivariateStats might be a better home for this as it's a much more general thing than just a diversity issue. I've been able to incorporate more basic things I needed into their packages in the past without any great difficulty by creating PRs in the style of their code... it can be harder to start a discussion without so code to refer to though.

richardreeve commented 6 years ago

I think the more fundamental issue is the one that you've been discussing second though - how to get consistent structures for data that allow us to easily use datasets in different packages - @mkborregaard and I have been trying to meet to discuss this for a few months, without any obvious success! In particular, I would like to attach abundances and relative abundances (i.e. ints and floats) and presence/absence (bits/booleans) to a whole series of different objects, from the simplest - just a list of names representing, say, species to phylogenies (where one dimension of the [relative] abundance matrix corresponds to the tips), sequence data (where one dimension corresponds to the sequences), taxonomies, and a variety of other things. The other dimension(s) also need to be able to represent samples, sites or some other category, or some one or more dimensional spatial layout. Some of this is already covered by this package, some by SpatialEcology, the sequence stuff presumably by Microbiome. I am looking hard at AxisArrays as a potential solution to many of these problems, and have been playing around with it a lot recently, but things like ComMatrix and AbundanceTable do something similar, though I'm not necessarily sure what their advantages are at this point (I know ComMatrix uses sparse arrays, but AxisArray can too, for instance). In any event, when I get back next week, I would very much like to try a lot harder to sort this out...

kescobo commented 6 years ago

💯 on interoperability in general - I agree that a lot that we do with the microbiome is ultimately ust applications of ecology. Anything I can do to help or provide feedback, let me know!

mkborregaard commented 6 years ago

@richardreeve yes, AxisArrays may be what we need in some form. ComMatrix was first a typealias for NamedMatrix{<:Union{Int, Bool}}, then a thin wrapper around a NamedMatrix, to give more control. But in the end the method forwarding had some issues (e.g. NamedArrays don't deal well with views into sparse Matrices, which SpatialEcology uses all the time), and I figured it was easier to keep track of the margin names myself. AxisArrays is probably better, so def worth looking into basing ComMatrix on that. There is no inherent advantage to using a ComMatrix rather than a basic AxisArray, except for dispatch - programming functions with sensible intuitive names that do intuitive things to the ComMatrix because we know what it is. Having a type allows e.g. enforcing species on columns and sites on rows (or vice versa), reducing the amount of mental fuel expended to keep track on such things and hopefully reducing bugs.

mkborregaard commented 6 years ago

Richard, we're debating the relative merits of

    species
 ------------
s|
i|
t|
e|
s|

vs

     sites
 ------------
s|
p|
e|
c|
i|
e|
s|

The first is consistent with R and the tidy data convention, the second may be faster if most operations work on a by-community scale (rather than by-species scale), given the column-major nature of julia. What are your thoughts on this?

richardreeve commented 6 years ago

I definitely go with the latter, and do in R as well incidentally, because I often have data in dataframes, which as as column-major as they get. However, using AxisArrays would potentially allow us to not enforce this:

julia> ab = AxisArray([1 2; 3 4; 5 6], Axis{:species}([:sp1, :sp2, :sp3]), Axis{:sites}([:site1, :site2]))
2-dimensional AxisArray{Int64,2,...} with axes:
    :species, Symbol[:sp1, :sp2, :sp3]
    :sites, Symbol[:site1, :site2]
And data, a 3×2 Array{Int64,2}:
 1  2
 3  4
 5  6

julia> ab[Axis{:species}(:sp2)]
1-dimensional AxisArray{Int64,1,...} with axes:
    :sites, Symbol[:site1, :site2]
And data, a 2-element Array{Int64,1}:
 3
 4

julia> ab[Axis{:sites}(:site2)]
1-dimensional AxisArray{Int64,1,...} with axes:
    :species, Symbol[:sp1, :sp2, :sp3]
And data, a 3-element Array{Int64,1}:
 2
 4
 6

Although we'd have to agree on some kind of standard for the axis naming, or store that information somewhere. We could of course strongly recommend that people used one ordering, and indeed not even mention that the other works, but it would give us flexiibility.

mkborregaard commented 6 years ago

Great, I'll go and rewrite every single matrix operation in SpatialEcology to work on a transposed matrix then :scream:

mkborregaard commented 6 years ago

No, I'm willing to do that. But, though I do see where both of you are coming from, there are many operation we do on community matrices all the time which summarize over species, i.e. the other way. You could be right that it is generally less common, though.

richardreeve commented 6 years ago

I think this kind of thing is why we need to meet up of course - useful though this is!

mkborregaard commented 6 years ago

Hmm it looks like the column names are part of the type signature of axisarrays:

typeof(ab)
#AxisArrays.AxisArray{Int64,2,Array{Int64,2},Tuple{AxisArrays.Axis{:species,Array{Symbol,1}},AxisArrays.Axis{:sites,Array{Symbol,1}}}}

So it would be doable to specialise ComMatrix as a parametric typealias of an AxisArray, and then dispatch our functions on ComMatrix. We could do the same for SubComMatrix, and then define AbstractComMatrix (which all the dispatching occurs on) as a Union rather than as an abstract type. Something like

const ComMatrix = AxisArrays.AxisArray{T,2,Array{T,2},Tuple{AxisArrays.Axis{:Species,Array{<:Union{String, Symbol},1}},AxisArrays.Axis{:Sites,Array{<:Union{String, Symbol},1}}}} where T<:Union{Float64, Int, Bool}

const SubComMatrix =AxisArrays.AxisArray{T,2,SubArray{T,2,Array{Float64,2},I,true},Tuple{AxisArrays.Axis{:Species,Array{<:Union{String, Symbol},1}},AxisArrays.Axis{:Sites,Array{<:Union{String, Symbol},1}}}} where T<:Union{Float64, Int, Bool} where I

const AbstractComMatrix = Union{SubComMatrix4, ComMatrix}

That would still allow us to dispatch freely on ComMatrix (and do syntactic sugar as species(x::AbstractComMatrix, sp) = x[Axis{:Species}(sp)], while allowing ComMatrix full AxisArray capabilities.

richardreeve commented 6 years ago

Yes, absolutely - that was the kind of thing I was thinking of. I was actually wondering whether you could do something like:

ab = AxisArray([1 2; 3 4; 5 6], Axis{Species}([:sp1, :sp2, :sp3]), Axis{Subcommunities}([:site1, :site2]))

where Species and Subcommunities are the types of thing in the communities <: AbstractTypes, and the community structure themselves <:AbstractPartition, and then do a parametric typealias on that. Unfortunately Axis{X} requires X to be a symbol, and doesn't seem to allow implicit conversion (with convert()) which was the only fix I could think of.

mkborregaard commented 6 years ago

I've got to admit I'm not completely sharp on your types to the degree that I could see what that would entail - I guess youre thinking of running subdiv on a MetaCommunity that is/has a ComMatrix as defined above, and use a slightly different AxisArray typealias to store the result?

richardreeve commented 6 years ago

All I was thinking was that in your case you fix the ComMatrix to have an axis called :Species, and I was hoping to allow it to be any appropriate category. We’ve had to call this a “Type” (somewhat confusingly in Julia), because there was objection to something like a sequence being referred to as a species.

mkborregaard commented 6 years ago

Ah, I get it now! @kescobo calls the same thing a "feature". This is the interesting thing regarding generality - e.g. you'd both like to work with strictly abundances, but the features can be anything. Whereas I tend to want to make sure things are really species, because I use the objects to connect to a phylogeny and an ecological network etc.

We can leave the Symbol unspecified for one of the axes to get a more open type, and possibly check the symbol against a Set of allowed symbols in the constructors, but if it's only for syntax rather than because we need to discern dispatch we could also just call it "Feature" and then use getters with the names we want (species(x, y) = feature(x,y) and so forth)

kescobo commented 6 years ago

I rather like this plan (to the extent I understand it). In the end we're talking about operations on tables where the names of rows and columns are useful. The rest is "just" context and readability (though of course context and readability are really important).

mkborregaard commented 6 years ago

Both, I've been thinking a lot about how general to make the type (i.e. ":Feature", vs ":Species", vs "Union{:Feature, :Species}"). Many things are a lot more intuitive conceptually when the term is specific (e.g. "richness", "rangesize" strictly pertain to a species), but there are also advantages of generality (e.g. OTU instead of species in microbiology). I am not sure generality is really the best here, though; often operations don't extend as easily conceptually as they operationally (think about the current terrible conflation of niche space with trait space in ecology, which is driven by computational similarities rather than theoretical ones). So, could you (both) give me some examples where the axes of a CommunityMatrix are not species and sites/samples , but it would be obvious to apply all the same operations? It seems like you've both done such analyses.

richardreeve commented 6 years ago

Well, the latter is easy. Using the AxisArray type, we're thinking of allowing both categorical sites, and quantitative sites based on quadrats, so the "site" bit would be Axis{:latitude}(0m:100m:100km) and (possibly additionally) Axis{:longitude}(0m:100m:100km) as a third dimension. Or alternatively Axis{:time}(0s:5*365*24*60*60s:20*365*24*60*60s) for 5-yearly samples to calculate temporal changes.

richardreeve commented 6 years ago

As far as the former is concerned, we use similarity to connect individuals or groups, so our groups can be taxonomic (e.g. sub-species, species, genera, families) or individuals (e.g. tips on a tree, sequences in an alignment), or groups with the same phenotype, etc. We have plenty of examples where we need this.

mkborregaard commented 6 years ago

OK, that use of AxisArray is very different from how I approach it in SpatialEcology - where I attach specific spatial objects to keep track of coordinates and always have a 2-d matrix of occurrences. But - essentially - if you plan to define all your diversity functions to dispatch on an AxisMatrix then they should be able to work on SpatialEcology.ComMatrix objects out of the box anyway.

kescobo commented 6 years ago

Hmm - the comment I wrote apparently didn't get posted, sorry for the delay 😞

In my lab, we have several outputs from our standard metagenomics analysis, most of which produce a table where columns are individual samples (often one sample per individual subject, but sometimes multiple longitudinal samples per subject, technical replicates or environmental samples), and rows are any of a number of features:

For all of these, we have a number of normalization, transformation, distance calculation and plotting steps, some of which are specific to data type, but many of which cross multiple datatypes

richardreeve commented 6 years ago

@mkborregaard We don't have any definite plans for the quantitative axes at this point. We're still speculating - at the moment, my main concern is to get everything working with 1.0 without disrupting our ongoing use of the code. We've now got Unitful and Phylo fixed, and some preliminary work with AxisArrays and JuliaDB, but we aren't going to fully commit to anything until those are all working with 0.7-pre / 1.0. I can't see that we wouldn't use the variety of different categorical types that @kescobo and I have discussed above, because it's so easy to do it that way, and so essential to our work. As far as the quantitative stuff is concerned though - I am still waiting for that meeting with you about how you see it all working :).

mkborregaard commented 6 years ago

You'll not be at the IBS in Evora in March by any chance?

richardreeve commented 6 years ago

Sadly I didn't know about it, so no...

kescobo commented 6 years ago

Hi guys - I was wondering if a consensus has been achieved at least w/r/t the backend underlying data matrices. Ultimately, I think @mkborregaard's notion of having functions to access data that make sense for underlying data types, even if they're just aliases of other functions, is probably a good one.

But right now, I sort of assume everything starts out in the form of a DataFrame because that's my workflow. I'd like to move towards something a bit more generic (like IterableTables?), at least in terms of getting data in and out of my types. I'm not sure if that's covered by the dependency on Spatial Ecology that M is putting together, or if it's something I can work towards independently.

Thoughts?

richardreeve commented 6 years ago

We're slowly getting there! @mkborregaard and I finally had a meeting last week to discuss how to merge things, and we've created a common interface and type hierarchy - currently in alpha at mkborregaard/EcoBase.jl - that we will hopefully both transition to shortly for Diversity and SpatialEcology. After that, the idea is to see how many actual concrete types we can put into there, and how many will need to stay separate - we're hoping that we can move as much as possible into EcoBase, and that other packages can then also use it as it develops.

kescobo commented 6 years ago

Lol, I love that all of the supertypes are just things - that nicely gets around the issue of what to call it :-P. Presumably all of those will just be overloaded with application-specific function and type aliases?

Seems great - please let me know if I can help! And @mkborregaard, the organization of that package at this point looks straightforward enough that, once you think it's in a usable state, I'd be happy to attempt to make the transition on Microbiome.jl and just have you review my changes rather than you doing the PR (if you think that would be easier... of course I don't mind if you just want to tackle it).

richardreeve commented 5 years ago

This is now fully implemented in Diversity v0.4.5, and you can plot Diversity Metacommunity objects, and calculate diversity of SpatialEcology Assemblage objects!

mkborregaard commented 5 years ago

:tada: 💖