Error in Running DVC Full pipeline

auesro commented 2 years ago

Dear @danielruss,

I have installed the DVC pipeline with success and started to run the analysis but:

    Learning anchors...
Error in if (npcs < mdim) { : argument is of length zero
Calls: FindTransferAnchors -> ValidateParams_FindTransferAnchors
Execution halted
ERROR: failed to reproduce 'dvc.yaml': failed to run: Rscript src/R/label_transfer.R, exited with 1

I checked the label_transfer.R file and I see that npcs=NULL (which is what you also mention in the Methods of the paper) and mdim=28, which now becomes a problem as the comparison NULL < 28 is tricky... when I change npcs=28 the pipeline runs without error but...not sure if thats entirely correct.

danielruss commented 2 years ago

@auesro I believe Seurat is using the transfer anchors as pseudo-features for the clustering space. I believe the error is caused because when you run PCA, you cannot have more principal components than features (e.g. if you are performing dimension reduction, you want less features not more). It's been awhile, since the paper, but I think we run the same way as the integration vignette (I think we ran in Seurat 2.x, so things likely have changed). So find the integration anchors, then essentially repeat the analysis on the integrated data (make sure you set the default assay).

label_transfer.R was written by @rbpatt2019 . I'm not sure why he set the npcs to null. I use the default value see my code below, which is essentially from the Seurat vignette.

@rbpatt2019 may respond also.

data.list <- data.list %>% 
  map(NormalizeData) %>% 
  map(FindVariableFeatures,selection.method = "vst", nfeatures = 2000)

# select features that are repeatedly variable across datasets for integration
features <- SelectIntegrationFeatures(object.list = data.list)

# this went quick up until the last object merged.  The whole thing took over 30 min... last object took 26min 41 sec
int.anchors <- FindIntegrationAnchors(object.list = data.list, anchor.features = features)

auesro commented 2 years ago

Hi @danielruss, thanks for the explanation. I am more familiar with Scanpy than Seurat but in label_transfer.R you have:

anchors <- FindTransferAnchors(
  reference = reference,
  query = query,
  normalization.method = "LogNormalize",
  reference.assay = "integrated",
  query.assay = "RNA",
  reduction = "pcaproject",
  features = VariableFeatures(object = reference),
  npcs = NULL
  dims = 1:28
)

And no mention of FindIntegrationAnchors...as you say it might have changed since version 2.x but I cannot find the old Seurat integration vignette online. Hopefully @rbpatt2019 can share some insights!

danielruss commented 2 years ago

Those Seurat guys are really good... I was able to find the original vignette

danielruss commented 2 years ago

@auesro see last comment

auesro commented 2 years ago

Ahh but you referred then to v3.0, thats why I did not find it! :P So basically you (and @rbpatt2019 ) adapted this part from there:

pancreas.query <- pancreas.list[["fluidigmc1"]]
pancreas.anchors <- FindTransferAnchors(reference = pancreas.integrated, query = pancreas.query, 
    dims = 1:30)
predictions <- TransferData(anchorset = pancreas.anchors, refdata = pancreas.integrated$celltype, 
    dims = 1:30)
pancreas.query <- AddMetaData(pancreas.query, metadata = predictions)

Right? Then I wonder why npcs=NULL instead of the default (at least today) 30.

auesro commented 2 years ago

Ok, I have done some more digging.

The function FindTransferAnchors changed from the Seurat v3.1.5 (the one used in the paper) to the one I am using today (v4.1.1) inside the renv environment: -In v3.1.5 @param npcs Number of PCs to compute on reference. If null, then use an existing PCA structure in the reference object -In v4.1.1 @param npcs Number of PCs to compute on reference if reference.reduction is not provided. And they added: @param reference.reduction Name of dimensional reduction to use from the reference if running the pcaproject workflow. Optionally enables reuse of precomputed reference dimensional reduction. If NULL (default), use a PCA computed on the reference object.

So today, I have to specify the reference.reduction="pca" and set npcs=NULL in order to replicate the intended analysis. Do we agree?

danielruss commented 2 years ago

yes, This seems to mean that if you want the same "null", use the reference.reduction="pca". This makes sense as they allow multiple reductions (think umap, tsne,cca).

auesro commented 2 years ago

For future users: -Using either npcs=28 with reference.reduction=NULL or npcs=NULL together with reference.reduction="pca" in the Russ NeuronsDoublets dataset results in almost exactly the same predictions (except for 7 cells that are classified differently, out of the ~28438 cells).

@danielruss another questions is: in this repository you mention that the query dataset requires normalized counts and variable features computed. However, in label_transfer.R you have:

counts <- GetAssayData(query_neural, assay = "RNA", slot = "counts")

Which means that for the neural network part you extract raw counts, right? That gets confussing because in the example notebook to run only the neural network you mention If you are running on Seurat data, your Seurat object has the matrix saved in seuratObject.data. which normally contains normalized data... so whats the correct way to go?

auesro commented 2 years ago

Ok, I just checked more carefully. You log-normalize the counts in the neural_network.py script. So if I understood correctly, we need both raw counts (for the neural network part) and normalized counts (for label transfer in Seurat) in the query dataset, right? In that case, it might be helpful you clarify somewhere that the pipeline requires both normalized and raw counts in the query dataset.

danielruss commented 2 years ago

@auesro The neural network is designed to run on the raw count, I believe that I actual use the log of the counts then divide by the max log counts to make it equal to one. I tried to avoid taking the log, see O'Hara and Kotze but it just worked better after log normalization. This of course makes sense. Do you put equal weight on the 20,001st count of a feature as you put on the second count?

The point of using raw counts was to simplify the analysis by letting the network learn the patterns of counts. Learning a function that approximates the complete analysis in one (very deep) step.

auesro commented 2 years ago

Of course @danielruss, all good on that front! Just trying to get the hang of it and getting confused by references to several types of count "formats" for different parts of the pipeline.

ArielLevineLabNINDS / SeqSeek_Classify_Full_Pipeline

Error in Running DVC Full pipeline #3