alienData - species similarity measurement

david-beauchesne commented 7 years ago

iEat requires the evaluation of similarity between species. While similarity measurement methods can and will be implemented directly in the function, it may be a good idea to allow users to include a similarity matrix of their own (e.g. variance-covariance matrix for species cooccurrence). Could / should this be added to the alienData class directly in the data formatting stage?

KevCaz commented 7 years ago

I'm not quite sure. I would go for "it depends on how many methods we implement will use it".

If iEat is the only method that requires a similarity matrix, I would rather let you add an argument sim to iEat that would handle a default calculation of the similarity matrix as well as the possibility of using a matrix directly provided by the user.
If most of the methods we implement will use such matrix, we should better add it to the alienData objects.

david-beauchesne commented 7 years ago

I asked mainly because letting the user provide it's own similarity matrix or matrices means that we need to add tests on the data to make sure it fits with the alienData class we use for all methods. It kind of feels like we are telling the user that they absolutely have to use our data class for our methods, except if they wish to use their own similarity matrix. In that case, make sure it's the right format. Feels like it duplicates the data formatting process.

KevCaz commented 7 years ago

You are right, need more time to think about it :)

david-beauchesne commented 7 years ago

Well I guess we can start by assuming that similarity measurements are completely built in the function, then we can figure out how to make it more flexible. It wouldn't change how the alienData function is used right now, it would simply add to it, so I guess it wouldn't matter if we enhance it later on.

Or another option is to create a separate function that creates an array of similarity matrices, either from the alienData class, from user provided matrix or matrices, or a combination of both, and then exports an alienDataSimilarity class-type object for use in iEat. Similarity measurement is the potentially the longest process in iEat, and separating it from the main algorithm could make sense. Users could run functions or create his own similarity array to provide and use multiple times with iEat instead of having to recalculate similarity every time they wish to predict interactions. This would actually make more sense in my mind.

david-beauchesne commented 7 years ago

So to address this I've begun building a function (alienSimilarity()) that evaluates similarity between taxa and exports an alienSimilarity class object. How to let users provide their own similarity matrices can be figured out later, but it shouldn't be too hard.

SteveViss commented 7 years ago

Not quite sure the similarity is the bottleneck in your approach. I would say that preparing/getting the data (for instance get the taxonomic information for each species to compute the taxonomic distance) is the one. But, if we use existent informations from the alienData objet such as phylo object and/or traits data.frame to compute similarities among species, that should be quick enough just a simple PCoA (right?). By the way, still a good practice to split the main function iEat into several functions.

david-beauchesne commented 7 years ago

Hmm you are right from a data preparation standpoint. From an analysis standpoint, however, similarities are very limiting. If all you have is a network with 50 species for which you are predicting interactions, then the 50 x 50 matrix you need to create is quite small and manageable. However, when you start including a detailed catalog, e.g. with 10K species, then the 10K x 50 matrix required takes quite some time to compute even with function that are optimized. If measurements take 20 minutes, then it's not a function you want to run multiple times for no reason.

Here is my suggestion then. It would be very easy to build iEat to allow users to either provide a preprocessed alienSimilarity object containing similarity matrices, or to simply evaluate it within the iEat function itself. That way people with limiting data can separate the process to make it more manageable, while people with smaller datasets can apply iEat directly without having to proceed with secondary steps.

Codewise, that would simply mean that I continue building the function to create alienSimilarity object, and call it in iEat if user does not provide alienSimilarity as an argument. What do you think?

SteveViss commented 7 years ago

I would say:

Build your function alienSimilarity() and store the result as a component of the alienData object.
Build consistency checks among your similarity matrix and the species list (see dfNodes in alienData object).
When tests passes, set iEat available in the availMeths attribute of the alienData object. @KevCaz, what do you think?

KevCaz commented 7 years ago

Hummm, this could be a good way to proceed.

We could use an argument in alienData() to specify whether or not a similarity matrix must be built, then we would use a couple of checks that would trigger different actions depending on the class of the object to be passed as argument. For instance, if it is a matrix, then we would only check rownames and colnames and use it as the similarity matrix if tests are OK.

david-beauchesne commented 7 years ago

I think it makes a lot of sense. I've already been building the function with consistency checks with alienData objects, so that'll be easy to fit in. Then I can check with @KevCaz how to incorporate it as an attribute of the alienData class.

KevCaz commented 7 years ago

We have a plan. Let's do this and close this issue once implemented.

SteveViss commented 7 years ago

Sounds good !

tpoisot commented 7 years ago

@david-beauchesne you called the method iEat? I'm sorry but are we trying to win a stupid-naming-conventions award for this package or what? If I give all of you a 🥇 will you promise to behave?

Considerations about the reasonable naming of functions aside, shouldn't this be also integrated with other packages somehow? As in, I agree that it should be a class attribute, but there are tons of distance measure packages existing, and maybe we want the users to be able to use some of them? I'm suggesting this because depending how traits, etc, are coded, there may be a need for more complex distance measures than we want to actually implement.

david-beauchesne commented 7 years ago

No official name yet. Dom's idea, nothing better came up, so it stuck. Better suggestions?

As for the similarity measures, yes, we want the users to eventually be able to apply whatever similarity measure they might wish to use. For simplicity's sake at this point, we simply elected to make this a bit more restrictive to provide a workable function short term. We won't be coding any similarity measures ourselves, except for tanimoto. The other measures we are suggesting are from vegan for now I think. That's also why we want to give the user the option to provide his own similarity matrix that we can include in the data. Makes sense?

guiblanchet commented 7 years ago

Without taking away what has already been discussed so far, a way to go that may overlap a little with what @SteveViss proposed is to add, not an attribute, but another level to the alienData "list" named something like coSim that would include the similarity matrix the user want to use.

guiblanchet commented 7 years ago

@david-beauchesne I just realized something about the alienSimilarity function you are currently working on. If you intend to give the option to the user to use various aspects of vegdist(vegan) it would be better to work with distances instead of similarities. In this context, it would also be a good idea for the output to be of class dist, which, from what I understand, is more efficient with regard to storage. So instead of calling the level coSim, it could be called dist.

david-beauchesne commented 7 years ago

@guiblanchet it depends on the data. I'm not fully comfortable with vegdist yet, but I know that I am including tanimoto similarity to take into account strings like taxonomy and resources. Could this still be feasible with dist objects?

For the moment, I want finish the first version of the function, which I can modify easily afterwards. I'm going for at least a working version before adding enhancements and modifications to it.

guiblanchet commented 7 years ago

What do you mean by "it depends on the data" ? About the Tanimoto similarity and making it as a dist object, I would have to check to make sure, but there are ways to convert similarities to distances, we just need to make sure the proper information is preserved.

That being said, I do think it is a good idea to finish a first (alpha) version of the function and then tweaked it if necessary.

SteveViss commented 7 years ago

@david-beauchesne, do you need help to review your code under the branch iEat?

TheoreticalEcosystemEcology / alien

alienData - species similarity measurement #32