BRANCHlab / metasnf

Scalable subtyping with similarity network fusion
https://branchlab.github.io/metasnf/
Other
7 stars 0 forks source link

comparison to ab-snf #8

Closed Denisesabac closed 8 months ago

Denisesabac commented 8 months ago

Hello! I was wondering if adding weights as a parameter in the batch_snf function calculates the weighed distance matrices in the same way as the ab-snf method developed by Ruan et al.?

pvelayudhan commented 8 months ago

If the ab-SNF method is to use this distance function:

https://github.com/pfruan/abSNF/blob/master/R/dist2_w.R

You should be able to replicate this effect exactly by providing weights and using the sew_euclidean_distance metric, not the default one.

If you use the default euclidean distance metric, the logic of what you are doing will match abSNF but the specific numbers may not align due to some slight scaling differences.

Also note that metasnf by default will redistribute your features across matrices in a variety of ways, whereas ab-SNF I believe adheres to the original "individual" SNF scheme. So make sure you are sticking to snf_scheme 1 too.

Some example code:

my_distance_metrics <- generate_distance_metrics_list(
    continuous_distances = list(
        "standard_norm_euclidean" = sew_euclidean_distance
    ),
    discrete_distances = list(
        "standard_norm_euclidean" = sew_euclidean_distance
    )
)

settings_matrix <- generate_settings_matrix(
    data_list,
    nrow = 10,
    distance_metrics_list = my_distance_metrics
)

weights_matrix <- (add your feature weights as they show up)

solutions_matrix <- batch_snf(
    data_list,
    settings_matrix,
    distance_metrics_list = my_distance_metrics,
    weights_matrix = weights_matrix
)

Some relevant pages:

And the source code of the "sew_euclidean_distance" method: https://github.com/BRANCHlab/metasnf/blob/7d89bbe23e34999e733dee66734a83afba9a1c05/R/distance_metrics.R#L82

Denisesabac commented 8 months ago

Thank you for the suggestions! I tried using snf_scheme 1 and the sew_euclidean_distance metric and still getting different clustering resul (using ab-snf and weighed meta_snf). I also tried standard normalizing the data before applying meta snf, and I am still getting different results. Is there anything else that might be accounting for this difference?

pvelayudhan commented 8 months ago

It may be easier to see if you can post the code you are using with abSNF and metasnf. There are a lot of ways in which things can diverge. Some things to consider:

If you are sure the settings are the same, I would start by checking to see if the affinity matrix produced by metasnf is the same or different.

To debug beyond that I think I would need your data and code for both the abSNF and metasnf implementations.

pvelayudhan commented 8 months ago

My mistake, it looks like sew_euclidean_distance and dist2_w don't actually produce the same results. Looking into this now.

pvelayudhan commented 8 months ago

Nevermind, it does look like sew_euclidean_distance and dist2_w do the same thing:

dist2_w=function (X, C, weight)
{
    for (i in 1:dim(X)[1]){
        X[i,]=sqrt(weight)*X[i,]
    }
    for (i in 1:dim(C)[1]) {
        C[i, ] = sqrt(weight) * C[i, ]
    }
    ndata = nrow(X)
    ncentres = nrow(C)
    sumsqX = rowSums(X^2)
    sumsqC = rowSums(C^2)
    XC = 2 * (X %*% t(C))
    res = matrix(rep(sumsqX, times = ncentres), ndata, ncentres) +
    t(matrix(rep(sumsqC, times = ndata), ncentres, ndata)) -
    res[res < 0] = 0
    return(res)
}

sew_euclidean_distance <- function(df, weights_row) {
    weights <- format_weights_row(weights_row)
    weights <- sqrt(weights)
    weighted_df <- as.matrix(df) %*% weights
    distance_matrix <- weighted_df |>
        stats::dist(method = "euclidean") |>
        as.matrix()
    distance_matrix <- distance_matrix^2
    return(distance_matrix)
}

weights <- 1:ncol(mtcars)

# returns TRUE
all.equal(
    sew_euclidean_distance(mtcars, weights), 
    dist2_w(as.matrix(mtcars), as.matrix(mtcars), weights)
)

So unfortunately further debugging is needed to identify what is causing the difference in your situation.

I would recommend creating and working through a minimal example, using small mock dataframes that you make up and trying to see if you can get the two approaches to line up like that. That will would be very helpful for trying to identify what is different between the two scenarios. Once you have a minimal example that doesn't match between the two cases, it'll also be much easier for me to help find what is going wrong.

Denisesabac commented 8 months ago

Thank you for the continued feedback! I now realized that even without applying the weights I am getting slightly different results using the standard approach outlined in the "SNFtool" package, and meta clustering. I copied my code for meta clustering below. Please let me know if you have any suggestions on what is causing this discrepancy.

I am using alpha=0.5 and k=20 for both versions. Also, the optimal number of clusters is the same using both eigen gap and rotation cost metrics.

add column with patient number to each df

SD_SNF <- rownames_to_column(SD_SNF2, var = "subject") Cog_SNF <- rownames_to_column(Cog_SNF2, var = "subject") WM_SNF <- rownames_to_column(WM_SNF2, var = "subject") CTV_SNF <- rownames_to_column(CTV_SNF2, var = "subject")

create list of all dfs used for meta clustering

data_list <- generate_data_list( list(SD_SNF, "sociodemographics", "demo", "continuous"), list(Cog_SNF, "cognitive", "cog", "continuous"), list(WM_SNF, "whitematter", "wm", "continuous"), list(CTV_SNF, "corticalthickness", "ctv", "continuous"), uid = "subject" )

summary of the data list

summarize_dl(data_list)

generate custom distance matrices

my_distance_metrics <- generate_distance_metrics_list( continuous_distances = list( "standard_norm_euclidean" = sn_euclidean_distance ) )

summarize_distance_metrics_list(my_distance_metrics)

create a matrix for the meta clustering settings

settings_matrix <- generate_settings_matrix( data_list, nrow = 1, min_k = 20, max_k = 20, distance_metrics_list = my_distance_metrics )

adjust parameters as needed

settings_matrix$alpha <- 0.5 settings_matrix$inc_cognitive <- 1 settings_matrix$clust_alg <- 2 settings_matrix$snf_scheme <- 1 settings_matrix$cont_dist <- 2

generate solutions matrix

solutions_matrix <- batch_snf(data_list, settings_matrix, distance_metrics_list = my_distance_metrics )

pvelayudhan commented 8 months ago

I don't notice anything jumping out as a reason for the discrepancy.

I may end up needing a toy version of your data by email to help troubleshoot further. Sorry!