Danko-Lab / TED

a fully Bayesian approach to deconvolve tumor microenvironment
60 stars 10 forks source link

Suggestions on building own single cell reference #29

Closed Tianqi-Ma closed 2 years ago

Tianqi-Ma commented 2 years ago

Hi, Tinyi

I was playing around this package recently. It's a great work but I had few questions.

So my goal was to deconvolute bulk-seq blood data to identify potential cancer cells in it. I would like to build my own single cell reference. So I downloaded adult peripheral blood with annotated cell types and SKBR3 (cancer cell) single cell data from public then make them into Seurat objects separately. Then merge these two Seurat object using merge function and extract the raw count matrix by GetAssayData function. The reason for using Seurat to combine two data is that I can't find a way to combine two matrix with different number of rows and columns (forgive my poor coding experience).

Then I read this issue https://github.com/Danko-Lab/TED/issues/26. I decide to follow the second advice as well. But I don't know how to collapse.

So my question are that:

  1. Does my plan for building single cell reference sound valid?
  2. Is there a better way for combing two matrix instead using Seurat object like me?
  3. The gene names in single cell data are usually gene symbols but the query data's gene names are ENSEMBL ID. Does it need to be consistent between ref and query dataset?
  4. Are there any general suggestions/principles on building reference? Like cell type proportion or number of cells?
  5. Would you please show me the code for collapsing the single cell data. Let's say the rows are genes and colnmns are cell barcodes, and annotated cell types are provided. (Please ignore this one if it's too rude.)

Much appreciated for replying.

Tianqi-Ma commented 2 years ago

BTW, this error keeps showing up.

> tcga.ted <- run.Ted (ref.dat= ref.dat.filtered,
+                      X=testData,
+                      cell.type.labels= cell.type.labels,
+                      cell.subtype.labels= cell.subtype.labels,
+                      tum.key="SKBR3",
+                      input.type="scRNA",
+                      n.cores=20,
+                      first.gibbs.only=T,
+                      pdf.name='scRNA_ref')
[1] "removing non-numeric genes..."
[1] "removing outlier genes..."
Number of outlier genes filtered= 19
[1] "aligning reference and mixture..."
[1] "run first sampling"
Start run... This may take a while
R Version:  R version 4.1.2 (2021-11-01)

snowfall 1.84-6.1 initialized (using snow 0.4-4): parallel execution on 20 CPUs.

Error in checkForRemoteErrors(val) :
  20 nodes produced errors; first error: subscript out of bounds

Is it from my ref data?

tinyi commented 2 years ago

Hi Tianqi,

Thank you for your interest in our work.

You may try the new version at https://github.com/Danko-Lab/BayesPrism .

To answer your question:

Does my plan for building single cell reference sound valid? Naively concatenating two scRNA-seq dataset, e.g. 10x + smart-seq, from two batches may lead to unwanted noise due to batch effects, which is not recommended.

The gene names in single cell data are usually gene symbols but the query data's gene names are ENSEMBL ID. Does it need to be consistent between ref and query dataset? Yes.

Are there any suggestions/principles on building reference? Like cell type proportion or number of cells? Please refer to the vignette in the new package.

Would you please show me the code for collapsing the single cell data (let's say the rows are genes and colnmns are cell barcodes, and annotated cell types are provided).

If you start with the raw count matrix, there is no need to collapse them manually. The function new.prism automatically takes care of it.

Best,

Tinyi

On Wed, Jun 29, 2022 at 6:02 AM Tianqi MA @.***> wrote:

BTW, this error keeps showing up.

tcga.ted <- run.Ted (ref.dat= ref.dat.filtered,

  • X=testData,
  • cell.type.labels= cell.type.labels,
  • cell.subtype.labels= cell.subtype.labels,
  • tum.key="SKBR3",
  • input.type="scRNA",
  • n.cores=20,
  • first.gibbs.only=T,
  • pdf.name='scRNA_ref') [1] "removing non-numeric genes..." [1] "removing outlier genes..." Number of outlier genes filtered= 19 [1] "aligning reference and mixture..." [1] "run first sampling" Start run... This may take a while R Version: R version 4.1.2 (2021-11-01)

snowfall 1.84-6.1 initialized (using snow 0.4-4): parallel execution on 20 CPUs.

Error in checkForRemoteErrors(val) : 20 nodes produced errors; first error: subscript out of bounds

Is it from my ref data?

— Reply to this email directly, view it on GitHub https://github.com/Danko-Lab/TED/issues/29#issuecomment-1169785849, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB4NHS6NUTPZXIK25DHNR7LVRQNKVANCNFSM52FBEE6Q . You are receiving this because you are subscribed to this thread.Message ID: @.***>

Tianqi-Ma commented 2 years ago

Thank you for replying and I will have a shot on that.