ChiLiubio / microeco

An R package for data analysis in microbial community ecology
GNU General Public License v3.0
181 stars 55 forks source link

trans_network: in function module_roles the values of z and p may calculated incorrectly #274

Closed trx296554555 closed 9 months ago

trx296554555 commented 9 months ago

Hi Liu,

Thank you for this awesome package! It has been an invaluable tool for me in the field of metagenomic analysis, saving me a significant amount of time and effort.

However, while utilizing the trans_network, I find some question that I would like to bring to your attention.


I needed to use relative abundance or RLE data to calculate the correlation coefficient, which was difficult to do with trans_network initialize, so I used NetCoMi to build the network and calculated the z-score and participation coeffiecient of each node by R.

During the analysis, I noticed an inconsistency between the z and p values obtained from my calculations compared to those generated using the "module_roles" function in trans_network. After careful investigation, it appears that there might be an error in the calculation process within the package.

Here's the formula for calculating z and p: from Deng, Y., Jiang, YH., Yang, Y. et al. Molecular ecological network analyses. BMC Bioinformatics 13, 113 (2012). https://doi.org/10.1186/1471-2105-13-113 link image

Here's the code and result in trans_network:

t1 <- trans_network$new(dataset = dataset, cor_method = "pearson", use_NetCoMi_pearson_spearman = TRUE, filter_thres = 0.0005)
t1$cal_network(COR_p_thres = 0.05, COR_cut = 0.5)
t1$cal_module(method = "cluster_fast_greedy")
t1$get_node_table(node_roles = TRUE)
node_res <- t1$res_node_table

t1$get_edge_table()
edge_res <- t1$res_edge_table

t1$get_adjacency_matrix()
adja_res <- t1$res_adjacency_matrix

in the node_res: image

and I find in "node 853" this z-score -0.61075946 can be calculated by the following code:

(20-subset(node_res, module=='M1')$degree%>%mean)/subset(node_res, module=='M1')$degree%>%sd()

and i checked edge_res, find the column 'degree' is the node degree in the whole network

edge_res %>% subset(node1=='853'|node2=='853') %>% nrow()

but in its define: "where kib is the number of links of node i to other nodes in its module b" the degree should only in its module.

and for p, I do not understand how participation_coeffiecient is calculated, and it seems that the formula is correct p[i] <- 1 - (sum((taxa.mod.degree)**2)/ki**2)

but it can be found that the P should be 0 in nodes that are connected only to their own nodes in the module.

like "node 1301", only connected with "node 1386", "node 851", "node 5820", which all in module 'M2', but p is 0.31327160.

Above problems seem to be caused by not distinguishing between the degree of a node in its own module and the degree of other module nodes.


Here I give my own code of calculation, I am not sure that it is correct, but its result is consistent with the description in the definition.

# Convert adjacency matrices into long-format data 
convertCorrelationMatrix <- function(cor_matrix) {
  cor_matrix[lower.tri(cor_matrix)] <- 0
  df <- melt(cor_matrix, varnames = c("node1", "node2"), value.name = "weight")
  df <- subset(df, weight != 0 & node1 != node2)
  return(df)
}

# Calculate the degree of the connected node and distinguish which module the degree of the node belongs to
calculate_degree_to_each_module <- function(node_table, adja_mtx) {
  connectivity_edge <- convertCorrelationMatrix(adja_mtx)
  df_merged <- merge(connectivity_edge, node_table, by.x = "node1", by.y = "name", all.x = TRUE)
  df_merged <- df_merged[, c("node1", "node2", "module")]
  colnames(df_merged)[3] <- "node1_module"
  df_merged <- merge(df_merged, node_table, by.x = "node2", by.y = "name", all.x = TRUE)
  df_merged <- df_merged[, c("node1", "node2", "node1_module", "module")]
  colnames(df_merged)[4] <- "node2_module"
  edge_module <- df_merged[c("node1", "node2", "node1_module", "node2_module")]
  node_to_module <- rbind(edge_module %>% select(node = node1, to_mod = node2_module), edge_module %>% select(node = node2, to_mod = node1_module))
  count_node_to_module <- node_to_module %>%
    group_by(node, to_mod) %>%
    summarize(count = n())
  degree_df <- matrix(0, nrow = length(unique(count_node_to_module$node)),
                      ncol = length(unique(count_node_to_module$to_mod)),
                      dimnames = list(unique(count_node_to_module$node), sort(unique(count_node_to_module$to_mod)))) %>% as.data.frame()
  for (i in 1:nrow(count_node_to_module)) {
    row_name <- as.character(count_node_to_module$node[i])
    col_name <- as.character(count_node_to_module$to_mod[i])
    value <- count_node_to_module$count[i]
    degree_df[row_name, col_name] <- value
  }
  degree_df$name <- rownames(degree_df)
  degree_df
}

# Calculate the z-score and participation coeffiecient of the node
calculate_z_p <- function(node_table, adja_mtx) {
  degree_df <- calculate_degree_to_each_module(node_table, adja_mtx)
  tmp_df <- merge(node_table, degree_df, by = 'name', all.x = T)
  tmp_df[is.na(tmp_df)] <- 0
  tmp_df$within_degree <- 0
  tmp_df$total_degree <- 0
  tmp_df$p_coeff <- 0

  if (0 %in% unique(tmp_df$module)) {
    module_num <- length(unique(tmp_df$module)) - 1
  } else {
    module_num <- length(unique(tmp_df$module))
  }

  for (i in 1:nrow(tmp_df)) {
    this_module <- tmp_df$module[i] %>% as.character()
    if (this_module != '0') {
      value <- tmp_df[[this_module]][i]
    } else {
      value <- 0
    }
    tmp_df$within_degree[i] <- value
    # p
    mod.degree <- tmp_df[i, paste0('M',1:module_num)]
    tmp_df$total_degree[i] <- mod.degree %>% sum()
    tmp_df$p_coeff[i] <- 1 - (sum((mod.degree)**2) / tmp_df$total_degree[i]**2)
  }

  # z
  tmp_df %<>%
    group_by(module) %>%
    mutate(z_score = (within_degree - mean(within_degree)) / sd(within_degree)) %>%
    ungroup()
  # In some cases, z is NaN, which is set to 0 (for example, module with only one node or standard deviation 0).
  # p is NaN when ki = 0
  tmp_df$z_score[is.na(tmp_df$z_score)] <- 0
  tmp_df$p_coeff[is.na(tmp_df$p_coeff)] <- 0
  tmp_df %>% select(-paste0('M',1:module_num))
}

# use node_res and adja_res  can recalculate p and z
new_node_res <- calculate_z_p(node_res, adja_res)

Once again, thank you for your incredible work and dedication to developing such a powerful and time-saving tool. I look forward to your response and the opportunity to contribute towards enhancing microeco.

Best regards, Ruixiang

ChiLiubio commented 9 months ago

Hi Ruixiang,

Thanks very much for your finding! Yes, it is a bug in zi. This part of codes have existed for a long time. I forgot how the zi is incorrectly calculated before. I have updated the package in github. I will release it in CRAN in the coming days. Please feel free to tell me for anything in this field. I appreciate your passion on the code checking and look forward to your participation in this project.

Best, Chi

trx296554555 commented 9 months ago

Hi Chi,

Thank you sincerely for your prompt response and quick resolution of the bug I found.

I found that bug with z have been revised in the latest commit. However, I am uncertain whether the calculation method for p-values has also been updated, as I previously encountered issues with the p-value results.

Specifically, I observed that nodes within the same module, which should ideally have a p-value of 0, were yielding non-zero p-values. This discrepancy contradicts the expected definition. I kindly request your assistance in reviewing the calculation process for p-values, if possible.

I greatly appreciate your attention to detail and continuous efforts to improve the accuracy and reliability!

Thank you for your time and dedication to maintaining the excellence of this project. I look forward to your response.

Best regards, Ruixiang

ChiLiubio commented 9 months ago

Hi Ruixiang,

Yesterday, it is late, so I only checked the z value part. Just now, I found you are right. There is also an issue in p value. It comes from the results mismatch of data frame names since previous version updating. The calculation is fine. Mismatch leads to the wrong things in the final table. Actually, such bugs should be checked by the test codes that I use by myself. But I found it is extremely hard to write a perfect self-testing code under current situation with so many complicated methods and framework. I will pay more attention to this. I have updated the package again in the github. Thank you so much for your finding and careful verification!

Best, Chi

ChiLiubio commented 9 months ago

One more thing that you can test the updated method again as a double check if you have time, even thougth I think it is fine now ◠‿◠ ◠‿◠. Thanks again! Could you please tell me your email for further communication?

trx296554555 commented 9 months ago

Ofc! I'd be happy to provide you with my email for further communication. Please reach out to me at rbio_tang@foxmail.com