Open dannyconrad opened 9 months ago
Hi @dannyconrad! Thanks for using ArchR! Please make sure that your post belongs in the Issues section. Only bugs and error reports belong in the Issues section. Usage questions and feature requests should be posted in the Discussions section, not in Issues.
It is worth noting that there are very few actual bugs in ArchR. If you are getting an error, it is probably something specific to your dataset, usage, or computational environment, all of which are extremely challenging to troubleshoot. As such, we require reproducible examples (preferably using the tutorial dataset) from users who want assistance. If you cannot reproduce your error, we will not be able to help.
Before going through the work of making a reproducible example, search the previous Issues, Discussions, function definitions, or the ArchR manual and you will likely find the answers you are looking for.
If your post does not contain a reproducible example, it is unlikely to receive a response.
In addition to a reproducible example, you must do the following things before we help you, unless your original post already contained this information:
1. If you've encountered an error, have you already searched previous Issues to make sure that this hasn't already been solved?
2. Did you post your log file? If not, add it now.
3.__ Remove any screenshots that contain text and instead copy and paste the text using markdown's codeblock syntax (three consecutive backticks). You can do this by editing your original post.
I didn't realize the Discussion section was separate and just noticed that someone brought up a similar concern at #1935, but it hasn't been answered yet. Since I consider this may be a "bug" even though it doesn't throw an error I'll leave this post in this section for now.
For additional context, here's an example of the scaled vs non-scaled corToDepth vectors, which would be the same if the reduced embedding was getting scaled by component instead of by cell:
> lapply(proj_2@reducedDims$IterativeLSI$corToDepth, head)
$scaled
LSI1 LSI2 LSI3 LSI4 LSI5 LSI6
0.03715766 0.23018184 0.09827107 0.10180009 0.11730853 0.22932629
$none
LSI1 LSI2 LSI3 LSI4 LSI5 LSI6
0.9068793 0.3038990 0.1302208 0.1026349 0.1169979 0.2494179
To add to this, it seems (in my hands) that the only serious change that scaleDims
brings about is whether or not that first dimension is excluded by the corCutOff
parameter. I won't pretend to have a grasp of the underlying math but it seems the row-scaling of the LSI matrix does not have much of an impact on the final UMAP embedding. Based on my testing below, my best guess is this is a result of the use of cosine
distance metric in running uwot::umap()
.
To try things out on a dataset I'm working on, I implemented my own column-based scaling and applied it to the non-scaled LSI embedding like so:
colZscores <- function (m = NULL)
{
z <- sweep(t(t(m) - colMeans(m)), 2, matrixStats::colSds(m), `/`)
return(z)
}
proj@reducedDims$LSI_ATAC_ColScale <- proj@reducedDims$LSI_ATAC_Raw
proj@reducedDims$LSI_ATAC_ColScale$matSVD <- colZscores(proj@reducedDims$LSI_ATAC_ColScale$matSVD)
Here are the 2nd and 3rd LSI dimensions plotted raw, with row-scaling, and with column-scaling.
Here are the resulting UMAPs computed using cosine distance (default). I believe this shows that only column-scaling transforms the UMAP embedding in a meaningful way compared to using the raw LSI values.
And finally, just out of curiosity, I computed UMAPs using euclidean distance. Interestingly, the row-scaled LSI UMAP looks very similar to the cosine-computed raw and row-scaled UMAPs. Not really sure why but thought it was interesting.
All UMAPs were generated with dimsToUse=2:30
.
Since the goal of the Z-scaling is to reduce bias from dominant early PCs, I think this demonstrates that the current implementation is not actually accomplishing that.
I read the issue threads #323 & #447 because I've been working a lot with the LSI components of my datasets lately. I think I found the source of the confusion that lead to those posts.
When scaleDims is set to T (the default), the rowZscores function is invoked which scales the LSI component values of each individual cell by row, i.e. using the mean and SD of each individual cell's N components. Obviously based on the name of the function this seems by design.
However I also saw in the documentation of getReducedDims() that the idea to do so was based on the scaling Tim Stuart introduced in Signac::RunSVD(). I dug into the code of RunSVD() and there the scaling is not done by row, but rather by column, i.e. scaling by the mean and SD of each component instead.
The relevant code within RunSVD():
As far as I can tell, in both cases the input matrix has components as columns and cells as rows.
This discrepancy is why the corToDepth vector of the scaled embeddings is so different and the scaled dimensions no longer really correlate with nFrags. Since the scaled values are used by default, LSI_1 is almost never filtered out by corCutOff even when it should be.
To reproduce/verify this, you can just check the ranked order of the values along each axis before and after the scaling is performed:
Because of the way it messes up the depth correlation and artificially rearranges the cells relative to one another in the lower dimensional space, I'm guessing this is not the correct way to scale the LSI dimensions, but maybe I'm wrong and this was done orthogonally to Signac on purpose? Or have I missed some key detail here?