biometry / bipartite

repository for the bipartite R-package for network analysis
35 stars 17 forks source link

Why does the order of the species in the adjacency matrix makes a difference with some indices? #14

Closed valentinitnelav closed 3 years ago

valentinitnelav commented 3 years ago

Thank you for this great work on bipartite!

I observed that the order of the species in the adjacency matrix makes a difference with some indices when using the function networklevel (Not shown below, but this also happens with specieslevel). Below is an example using the Safariland data.

The species' names come in a particular order. I noticed that if I alter that order, I can get differences in some network indices and this seems unexpected for me. Would you help me understand why is it so?

library(bipartite)

# Compute all network indices
set.seed(1) # to be sure is not because of the random number generation, I set its state here
net_1 <- bipartite::networklevel(web = Safariland)

# Shuffle column names
set.seed(1)
Safariland_2 <- Safariland[, sample(colnames(Safariland))]

set.seed(1)
net_2 <- networklevel(web = Safariland_2)

# The two results are different
identical(net_1, net_2)
## FALSE

# Not all metrics change, but here are the ones where differences appear:
net_df <- data.frame(net_1, net_2)
net_df$dif <- net_df$net_1 - net_df$net_2
net_df[net_df$dif != 0,]

##                          net_1      net_2         dif
## nestedness          19.9081024 19.8819392  0.02616325
## weighted nestedness  0.3852073  0.4056451 -0.02043776
## discrepancy.HL      21.0000000 19.0000000  2.00000000
## extinction.slope.HL  2.6650877  2.8054999 -0.14041217
## extinction.slope.LL  1.3179904  1.3745174 -0.05652698
## robustness.HL        0.7186344  0.7302413 -0.01160692
## robustness.LL        0.5640681  0.5760280 -0.01195993

Also if I order alphabetically the column names, I get different results:

Safariland_3 <- Safariland[, sort(colnames(Safariland))]

set.seed(1)
net_3 <- networklevel(web = Safariland_3)

net_df <- data.frame(net_1, net_3)
net_df$dif <- net_df$net_1 - net_df$net_3
net_df[net_df$dif != 0,]

##                                      net_1        net_4           dif
## nestedness                      19.9081024   19.7548961  0.1532062840
## weighted nestedness              0.3852073    0.4095266 -0.0243192413
## togetherness.HL                  0.1606838    0.1785375 -0.0178537512
## discrepancy.HL                  21.0000000   22.0000000 -1.0000000000
## extinction.slope.HL              2.6650877    2.6336544  0.0314333647
## extinction.slope.LL              1.3179904    1.3208604 -0.0028699917
## robustness.HL                    0.7186344    0.7202266 -0.0015922022
## robustness.LL                    0.5640681    0.5643816 -0.0003134797
## functional.complementarity.HL 1683.8042072 1683.7006539  0.1035533906

Information About the R Session:

sessionInfo()
## R version 4.1.0 (2021-05-18)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.2 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=de_DE.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=de_DE.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=de_DE.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] bipartite_2.16       sna_2.6              network_1.17.1      
## [4] statnet.common_4.5.0 vegan_2.5-7          lattice_0.20-44     
## [7] permute_0.9-5       
## 
## loaded via a namespace (and not attached):
##  [1] spam_2.7-0        tidyselect_1.1.1  xfun_0.24         purrr_0.3.4      
##  [5] splines_4.1.0     colorspace_2.0-2  vctrs_0.3.8       generics_0.1.0   
##  [9] htmltools_0.5.1.1 viridisLite_0.4.0 yaml_2.2.1        mgcv_1.8-36      
## [13] utf8_1.2.1        rlang_0.4.11      pillar_1.6.1      glue_1.4.2       
## [17] DBI_1.1.1         lifecycle_1.0.0   stringr_1.4.0     fields_12.5      
## [21] dotCall64_1.0-1   munsell_0.5.0     gtable_0.3.0      coda_0.19-4      
## [25] evaluate_0.14     knitr_1.33        parallel_4.1.0    fansi_0.5.0      
## [29] highr_0.9         scales_1.1.1      gridExtra_2.3     ggplot2_3.3.5    
## [33] digest_0.6.27     stringi_1.7.3     dplyr_1.0.7       grid_4.1.0       
## [37] tools_4.1.0       magrittr_2.0.1    maps_3.3.0        tibble_3.1.2     
## [41] cluster_2.1.2     crayon_1.4.1      pkgconfig_2.0.3   MASS_7.3-54      
## [45] ellipsis_0.3.2    Matrix_1.3-4      assertthat_0.2.1  rmarkdown_2.9    
## [49] viridis_0.6.1     R6_2.5.0          igraph_1.2.6      nlme_3.1-152     
## [53] compiler_4.1.0
cdormann commented 3 years ago

Many thanks for the question and extensive documentation! The question "Why are some indices dependent on the sequence of species in the matrix?" is a simple: "Ties."

Let's take discrepancy: This uses only the binary information, i.e. link or no link. Many species now have the same number of links, and there is no natural sequence in which these should be sorted. bipartite::discrepancy does not sort within ties, so when you shuffle the rows/columns, you get different sequences and results. vegan::nesteddisc goes through (up to 200) iterations to re-arrange the matrix for maximal discrepancy, i.e. find some "natural" sequence within ties. Often that works, but not necessarily (see the functions help page). I think vegan::nesteddisc is better than bipartite::discrepancy and have just changed "networklevel" accordingly.

The same holds for nestedness and the simulation of extinction sequences (where I did not change anything in the code).

The ties-problem arises particularly in networks with many singleton observations, or low number of observations overall. There is nothing we can do about it (I think), but I will add some notes in networklevel and specieslevel to make this clearer.

My experience is that in comparison to null models or among different networks, the tie-induced variability of indices typically is unproblematic. But for any specific situation, it is worth knowing why this variability arises.