Closed auesro closed 2 years ago
@auesro I believe Seurat is using the transfer anchors as pseudo-features for the clustering space. I believe the error is caused because when you run PCA, you cannot have more principal components than features (e.g. if you are performing dimension reduction, you want less features not more). It's been awhile, since the paper, but I think we run the same way as the integration vignette (I think we ran in Seurat 2.x, so things likely have changed). So find the integration anchors, then essentially repeat the analysis on the integrated data (make sure you set the default assay).
label_transfer.R
was written by @rbpatt2019 . I'm not sure why he set the npcs to null. I use the default value see my code below, which is essentially from the Seurat vignette.
@rbpatt2019 may respond also.
data.list <- data.list %>%
map(NormalizeData) %>%
map(FindVariableFeatures,selection.method = "vst", nfeatures = 2000)
# select features that are repeatedly variable across datasets for integration
features <- SelectIntegrationFeatures(object.list = data.list)
# this went quick up until the last object merged. The whole thing took over 30 min... last object took 26min 41 sec
int.anchors <- FindIntegrationAnchors(object.list = data.list, anchor.features = features)
Hi @danielruss, thanks for the explanation.
I am more familiar with Scanpy than Seurat but in label_transfer.R
you have:
anchors <- FindTransferAnchors(
reference = reference,
query = query,
normalization.method = "LogNormalize",
reference.assay = "integrated",
query.assay = "RNA",
reduction = "pcaproject",
features = VariableFeatures(object = reference),
npcs = NULL
dims = 1:28
)
And no mention of FindIntegrationAnchors
...as you say it might have changed since version 2.x but I cannot find the old Seurat integration vignette online.
Hopefully @rbpatt2019 can share some insights!
Those Seurat guys are really good... I was able to find the original vignette
@auesro see last comment
Ahh but you referred then to v3.0, thats why I did not find it! :P So basically you (and @rbpatt2019 ) adapted this part from there:
pancreas.query <- pancreas.list[["fluidigmc1"]]
pancreas.anchors <- FindTransferAnchors(reference = pancreas.integrated, query = pancreas.query,
dims = 1:30)
predictions <- TransferData(anchorset = pancreas.anchors, refdata = pancreas.integrated$celltype,
dims = 1:30)
pancreas.query <- AddMetaData(pancreas.query, metadata = predictions)
Right?
Then I wonder why npcs=NULL
instead of the default (at least today) 30.
Ok, I have done some more digging.
The function FindTransferAnchors
changed from the Seurat v3.1.5 (the one used in the paper) to the one I am using today (v4.1.1) inside the renv environment:
-In v3.1.5 @param npcs Number of PCs to compute on reference. If null, then use an existing PCA structure in the reference object
-In v4.1.1 @param npcs Number of PCs to compute on reference if reference.reduction is not provided.
And they added:
@param reference.reduction Name of dimensional reduction to use from the reference if running the pcaproject workflow. Optionally enables reuse of precomputed reference dimensional reduction. If NULL (default), use a PCA computed on the reference object.
So today, I have to specify the reference.reduction="pca"
and set npcs=NULL
in order to replicate the intended analysis.
Do we agree?
yes, This seems to mean that if you want the same "null", use the reference.reduction="pca". This makes sense as they allow multiple reductions (think umap, tsne,cca).
For future users:
-Using either npcs=28
with reference.reduction=NULL
or npcs=NULL
together with reference.reduction="pca"
in the Russ NeuronsDoublets dataset results in almost exactly the same predictions (except for 7 cells that are classified differently, out of the ~28438 cells).
@danielruss another questions is: in this repository you mention that the query dataset requires normalized counts and variable features computed. However, in label_transfer.R
you have:
counts <- GetAssayData(query_neural, assay = "RNA", slot = "counts")
Which means that for the neural network part you extract raw counts, right? That gets confussing because in the example notebook to run only the neural network you mention If you are running on Seurat data, your Seurat object has the matrix saved in seuratObject.data.
which normally contains normalized data... so whats the correct way to go?
Ok, I just checked more carefully.
You log-normalize the counts in the neural_network.py
script.
So if I understood correctly, we need both raw counts (for the neural network part) and normalized counts (for label transfer in Seurat) in the query dataset, right?
In that case, it might be helpful you clarify somewhere that the pipeline requires both normalized and raw counts in the query dataset.
@auesro The neural network is designed to run on the raw count, I believe that I actual use the log of the counts then divide by the max log counts to make it equal to one. I tried to avoid taking the log, see O'Hara and Kotze but it just worked better after log normalization. This of course makes sense. Do you put equal weight on the 20,001st count of a feature as you put on the second count?
The point of using raw counts was to simplify the analysis by letting the network learn the patterns of counts. Learning a function that approximates the complete analysis in one (very deep) step.
Of course @danielruss, all good on that front! Just trying to get the hang of it and getting confused by references to several types of count "formats" for different parts of the pipeline.
Dear @danielruss,
I have installed the DVC pipeline with success and started to run the analysis but:
I checked the
label_transfer.R
file and I see thatnpcs=NULL
(which is what you also mention in the Methods of the paper) andmdim=28,
which now becomes a problem as the comparisonNULL < 28
is tricky... when I changenpcs=28
the pipeline runs without error but...not sure if thats entirely correct.