Rmomal / EMtree

Infers species direct association networks
https://rmomal.github.io/EMtree/
7 stars 2 forks source link

Post-treatments and warning messages #4

Open GENG126 opened 2 years ago

GENG126 commented 2 years ago

"data" is the species abundance matrix

"site" is the environmant covariance matrix

HU is a factor in site

fit PLN model

PLNMODEL<- PLN(data~site$HU) Initialization... Adjusting a PLN model with full covariance model Post-treatments... DONE! Warning message: In dpois(y, mu, log = TRUE) : non-integer x = 1.214953

Follow is my questions:

  1. I am not sure whether the warning message is because of too limit data or too many zero in the data;
  2. The model can still run and I can get the result of the inference network, so I am curious whether this result is robust;
  3. I also want to know that if the result is not robust what can I do to infer species interaction network?
Raphaellemomal commented 2 years ago

I do not have an intuition for this warning but your questions are very relevant. Could you please provide:

Also, the robustness of your network is linked to the stability of the edges. Could you please provide the stability curve for the stability selection step ? (plot given by the EMtree::StaATS() function, you will have an example in the vignette here).

GENG126 commented 2 years ago

I do not have an intuition for this warning but your questions are very relevant. Could you please provide:

  • the dimension of your data?
  • the summary of the vector of percentage of zeros for each species ?

Also, the robustness of your network is linked to the stability of the edges. Could you please provide the stability curve for the stability selection step ? (plot given by the EMtree::StaATS() function, you will have an example in the vignette here).

  1. data dimension "data" is the species abundance matrix,with 8 obs. of 32 variables, so we have 32 species data in 8 years; QUESTION The rows here are species abundance in different year (actually is dry season in 8 years, I want to see whether there is difference in species interaction in different season, and we have 8-year data), because we have different number of plots each year, so we add up to total abundance of one species each year. I don't know if this is feasible, or I need to change "years" to "plots"? data

    > dim(data)
    [1]  8 32

    "site" is the environment covariance matrix, with 8 obs. of 8 variables, variables including disturbs such as human disturb, temperature etc. site

    > dim(site)
    [1] 8 8
  2. zeros percentage for each species ··· 0 0 37.5 75 50 0 87.5 87.5 87.5 87.5 0 0 87.5 75 0 25 0 0 0 0 0 25 0 0 62.5 37.5 87.5 87.5 0 12.5 100 100

    But if I exclude species which 0 occur more than 25%, the same warning as I mentioned before.

    PLNMODEL<- PLN(data1~site$HU) Initialization... Adjusting a PLN model with full covariance model Post-treatments... DONE! Warning message: In dpois(y, mu, log = TRUE) : non-integer x = 1.214953

  3. stability curve I can't run the EMtree::StaATS() function, and in the example the function name is "StATS()"

    > library(EMtree)
    > EMtree::StaATS
    error: 'StaATS' is not an exported object from 'namespace:EMtree'
    >   stab_selection=StATS(ResampEmtreeFit$Pmat, nlambda=50, stab.thresh=0.9,plot=TRUE)
    Error in StATS(ResampEmtreeFit$Pmat, nlambda = 50, stab.thresh = 0.9,  : 
    No "StATS" function
Raphaellemomal commented 2 years ago

It is possible the warning is a result of your very few number of observations (rows).

  1. I would recommend you consider all plots as rows in your data (so change "years" to "plots" as you say), and add a "year" covariate as a factor in your "site" matrix. You will have to include the year covariate in the PLNmodel.
  2. You should not consider species with no counts (100% zeros). Even if this does not solve the issue, no counts at all is like having no data, they are unnecessary dimensions.
  3. Please update your version of EMtree, the StATS function is very helpful to set the appropriate threshold on your edges regarding the stability of your network.
GENG126 commented 2 years ago

It is possible the warning is a result of your very few number of observations (rows).

  1. I would recommend you consider all plots as rows in your data (so change "years" to "plots" as you say), and add a "year" covariate as a factor in your "site" matrix. You will have to include the year covariate in the PLNmodel.
  2. You should not consider species with no counts (100% zeros). Even if this does not solve the issue, no counts at all is like having no data, they are unnecessary dimensions.
  3. Please update your version of EMtree, the StATS function is very helpful to set the appropriate threshold on your edges regarding the stability of your network.

Thank you Momal, I will try, thank you very much!

GENG126 commented 2 years ago

Hi, Momal. I could not understand the threshold setting in "stability". My questions are:

freqs<-freq_selec(ResampEmtreeFit$Pmat,Pt=0.2)
a_first_idea_of_network<-1*(freqs>0.8)
  1. "Pt" here is the threshold came from "stability"? I am quite confused. For example, I need to run the function StATS(). set the stability threshold=0.9.
    stab_selection=StATS(ResampEmtreeFit$Pmat, nlambda=50, stab.thresh=0.9,plot=TRUE)
    stab_selection$lambda_opt
    ##0.031

    Rplot01 So the Pt here I need to set as 0.031?

  2. And I don't understand why "the optimal frequencies still need to be thresholded in order to obtain a network". If the network is stable according to the stability select, can I just use the visualizing function and do not select frequency?
    weighted_net=ToSym(stab_selection$freqs_opt)
    g<-draw_network(weighted_net,nodes_label = species.names, title="Weighted", pal_edges="dodgerblue3",layout="stress",shade = TRUE, btw_rank=3)$G

sorry, too much questions, I am the beginner of data analysis. Looking forward to you reply. Best wishes!

Raphaellemomal commented 2 years ago

Hi sorry for my late answer. I will try to explain better here with this diagram. To obtain a network with the resampling version of EMtree, you need to set two thresholds: one on the edges conditional probabilities (output from the main method) and one on the frequencies.

image

But the problem is that we don't known how to set Pt a priori because they are probabilities scaled for a tree structure (the sum of all probabilities is the number of edges in a tree graph), they are not really edges probabilities of presence. This is where StATS() helps: it provides the optimal Pt, which is the value for which we have a good stability value across resamples, which means a good agreement between the resamples on the selected edges.

Additionally to providing a solution to the choice of Pt, you see here that we change of quantity of interest, going from conditional probabilities to well-understandable selection frequencies. This is important for interpretation : nos the weight on edges is actually a measure of the robustness of edges.

So after ResampleEMtree(), use StATS() to find the optimal Pt, and then you can either threshold the edge selection frequencies if you wish a binary network, or juste continue with the edges frequencies to work with a weighted network.

Is that clearer?

GENG126 commented 2 years ago

Hi sorry for my late answer. I will try to explain better here with this diagram. To obtain a network with the resampling version of EMtree, you need to set two thresholds: one on the edges conditional probabilities (output from the main method) and one on the frequencies.

image

But the problem is that we don't known how to set Pt a priori because they are probabilities scaled for a tree structure (the sum of all probabilities is the number of edges in a tree graph), they are not really edges probabilities of presence. This is where StATS() helps: it provides the optimal Pt, which is the value for which we have a good stability value across resamples, which means a good agreement between the resamples on the selected edges.

Additionally to providing a solution to the choice of Pt, you see here that we change of quantity of interest, going from conditional probabilities to well-understandable selection frequencies. This is important for interpretation : nos the weight on edges is actually a measure of the robustness of edges.

So after ResampleEMtree(), use StATS() to find the optimal Pt, and then you can either threshold the edge selection frequencies if you wish a binary network, or juste continue with the edges frequencies to work with a weighted network.

Is that clearer?

Yes!!! Never understand a model so clear before!! Thank you so so so so so much Momal!!

GENG126 commented 2 years ago

Hi, Momal: I am here again! I still have questions in "PLN" step. Sorry, it is quite hard for me to handle this problem. As you suggested first time, I need to consider "year" as a covariate and use "plots" as the rows of the matrix. So I change my mind, I construct network by each year-season (eg. 2013 dry season).

data<- as.matrix(read.csv("2013dry.csv"))
dim(data)
##23  10
site<- as_tibble(read.csv("2013dry.cor.csv"))
dim(site)
##23 2

data 5c844d1c9dc4f01f2e860b9bbb4414e

#Fit PLN model
PLNMODEL<- PLN(data~site$HL)
warning:
1: In dpois(y, mu, log = TRUE) : non-integer x = 8.333333
2: In dpois(y, mu, log = TRUE) : non-integer x = 83.333333
3: In dpois(y, mu, log = TRUE) : non-integer x = 4.545455
4: In dpois(y, mu, log = TRUE) : non-integer x = 3.351955
5: In dpois(y, mu, log = TRUE) : non-integer x = 2.500000
6: In dpois(y, mu, log = TRUE) : non-integer x = 2.777778
7: In dpois(y, mu, log = TRUE) : non-integer x = 83.333333
8: In dpois(y, mu, log = TRUE) : non-integer x = 11.111111
9: In dpois(y, mu, log = TRUE) : non-integer x = 5.263158
10: In dpois(y, mu, log = TRUE) : non-integer x = 66.666667
11: In dpois(y, mu, log = TRUE) : non-integer x = 11.904762
12: In dpois(y, mu, log = TRUE) : non-integer x = 1.818182
13: In dpois(y, mu, log = TRUE) : non-integer x = 13.888889
14: In dpois(y, mu, log = TRUE) : non-integer x = 4.166667
stab_selection=StATS(ResampEmtreeFit$Pmat, nlambda=50, stab.thresh=0.9,plot=TRUE)

Rplot01 This is the result of stable selection. And I can get the network: Rplot1

Can I just ignore these warnings? I try each year-season data, on the one hand, I exclude species data with too many "0"; on the other hand, some year-season has many plots like 37. But all have the warnings.

Looking forward to your reply! Best wishes. Geng Ying.

Raphaellemomal commented 2 years ago

Hi, sorry again for the late answer. I talked with Julien Chiquet about this (owner of PLNmodels), we consider it is fine for you to pursue with your model as it is. I just have a small question: how is it that you have data dimension 23 x 10, and 25 species in your network ?

Best, Raphaëlle

GENG126 commented 2 years ago

Thank you so much for the reply! It help me to continue my work! The data dimension 23 x 10 is the species I selected more than 3 times in the matrix, and the network I showed is all the species including some species which may only occur once (I try whether there is difference with or without these species, and the PLN result is no). I think the network is more complete to show with 25 species to you, so I use this network. Sorry the inconsistency before and after confuses you.

GENG126 commented 2 years ago

Hi Momal: Sorry I still have questions about EMtree.

ResampEmtreeFit<-ResampleEMtree(counts=data, S=999, covar_matrix = site$HL, maxIter=25,cond.tol=1e-8, cores=1)

I set different number of maxlter and get different network interactions. So I am curious how to set a maxIter. I have 8-year data, I want to show network each year. And the species I have most is 21 in 2020. So I think maxIter is 21 in all networks, so that all (8) networks have the uniform standards. Do I understand correctly? And I have no idea how to set cond.tol, is this influence my result?

Looking forward to your reply!

Best wishes! Geng Ying.

Rmomal commented 2 years ago

Hi Ying,

You see that it is mostly about the algorithm's convergence, that you can check by running the simple EMtree() function with plot=TRUE (as a first check for maxiter, before ResampleEMtree() for example.

Best, Raphaëlle

Le mar. 31 mai 2022 à 04:34, GENG Ying @.***> a écrit :

Hi Momal: Sorry I still have questions about EMtree.

ResampEmtreeFit<-ResampleEMtree(counts=data, S=999, covar_matrix = site$HL, maxIter=25,cond.tol=1e-8, cores=1)

I set different number of maxlter and get different network interactions. So I am curious how to set a maxIter. I have 8-year data, I want to show network each year. And the species I have most is 21 in

  1. So I think maxIter is 21 in all networks, so that all (8) networks have the uniform standards. Do I understand correctly? And I have no idea how to set cond.tol, is this influence my result?

Looking forward to your reply!

Best wishes! Geng Ying.

— Reply to this email directly, view it on GitHub https://github.com/Rmomal/EMtree/issues/4#issuecomment-1141607415, or unsubscribe https://github.com/notifications/unsubscribe-auth/AH2P2ZIUMZZRL756X4YH7G3VMV3EJANCNFSM5N4GSVIQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>

GENG126 commented 2 years ago

Hi Ying, - maxIter is the "Maximum number of EMtree iterations at each sub-sampling". It controls the number of iterations of the algorithm, so setting it to 21 can be too low and stopping the algorithm before it gets a chance to reach convergence. Il would be better to set this parameter to at least 50. - Your S parameter is very high, not that it is a problem but I think it is unnecessary. S is the "Total number of sub-samples.", and as you have 21 samples maximum setting S to 500 is enough. - cond.tol should be very low as it is a precision threshold for the computations. Default value might be too low as you have little number of samples, but I do not recommend to set it above 1e-6. Setting this higher creates a risk that the algorithm does not converge properly. You see that it is mostly about the algorithm's convergence, that you can check by running the simple EMtree() function with plot=TRUE (as a first check for maxiter, before ResampleEMtree() for example. Best, Raphaëlle Le mar. 31 mai 2022 à 04:34, GENG Ying @.***> a écrit :

Hi Momal: Thank you very much for the reply!

Best wishes! Geng.