broadinstitute / infercnv

Inferring CNV from Single-Cell RNA-Seq
Other
554 stars 162 forks source link

Infercnv and seurat (related to issue #215) #462

Open jjacob12 opened 1 year ago

jjacob12 commented 1 year ago

Hello All, I am excited by infercnv but have some rather basic questions for you, reflecting my lack of a strong bioinformatics/programming background. Please bear with me...

I posted this question under the already closed issue #215, and thought it best to open a new issue. Hope that is ok.

I wish to use infercnv to separate out cancer (Medulloblastoma) cells co-cultured with wild-type cerebellar organoids in a scRNA-seq experiment with thousands of cells. I have the following conditions and controls each with associated Seurat objects:

  1. wild-type cerebellar organoid (generated from iPSC differentiation protocol)
  2. medulloblastoma cell line grown as a monolayer
  3. medulloblastoma cell line grown as a spheroid
  4. medulloblastoma cells co-cultured with wild-type cerebellar organoid - it is for this condition I would like to separate cancer cells from the wild-type cells. The wiki for infercnv mentions several inputs:

raw_counts_matrix: the code to generate this matrix the Seurat object is shown. Would that be for condition 4?

annotations_file: There is an example annotations file on (https://github.com/broadinstitute/inferCNV/wiki/File-Definitions), but it is not clear to me how to generate this file. This question was asked by @jdriver509 in issue #215, but I would be very grateful for some additional help. The simplest option would be to use the data from the Seurat object for one or more of the 4 conditions above - reading further in #215, one of @GeorgescuC answers recommends using the active.ident field to generate the annotation table. The active.ident field has the cell IDs and their cluster allocation. Could you please tell me how?

the gene_order_file input is provided by the Broad Institute.

ref_group_names is the last input. To keep things simple I wish to assign all the WT cerebellar organoid cells as 'normal'. In the 'annotations_file' do I simply populate the 2nd column with 'normal' for all the WT cell IDs? Obviously, I can only do that for the WT cerebellar organoid, but is that the right matrix to use for the 'raw_counts_matrix' input? Thanks in advance, and apologies for the long question.

GeorgescuC commented 1 year ago

Hi @jjacob12 ,

When running infercnv, it is best to have a set of cells of the same cell types you are looking at and that you know are healthy to use them as a reference. In your case, the cells from conditions 1 appear to be exactly that, so they would be perfect for a reference. Infercnv also accepts either files to read from disk or matrix objects already loaded in R as inputs.

For your case, you can take the data from both conditions 1 and 4 and use both combined into a single matrix as input to infercnv. To do that, extract the raw count matrices from both seurat objects as shown and simply combine them. cbind() should be enough as long as you have cells as columns and genes as rows, but you might have some genes only present in one of the matrices, in which case you should add those genes filled with 0s to the missing matrices.

To access the active.ident of the Seurat object you can simply use seurat_obj[["active.ident"]] but if that is not defined or not the information you expected, you can select a different field from seurat_obj@meta.data that contains the information you want. The format of this file/matrix simply needs to have the same cell names that are in your input matrix in the first column or as row names, and the annotation in the (second) column. So yes, you can simply assign "normal" to all WT cells from conditions 1, although you can also be more precise if you have identified different cell types among them.

I am not sure how clear this explanation is so if you have any questions or doubts feel free to ask.

Regards, Christophe.

jjacob12 commented 1 year ago

Hi @GeorgescuC, Many thanks for getting back to me. Yes, your instructions were clear! I think I have prepared all the input files. For the counts matrix I combined the data from conditions 1 and 4 into a single matrix as you recommended, taking into account that some genes are expressed only in one matrix not the other (using merge).

For the sample annotation file, I have extracted the metadata - for the 'reference' condition, condition 1, I have changed orig.ident in Seurat_obj metadata to 'normal' (these are wild type cells). For condition 4, I have changed orig.ident to 'cancer' although there is a mixture of wild type cells (same as condition 1) and cancer cells in this sample. For the input ref_group_names, which I am unsure is definitely needed, I propose: ref_group_names=c("normal", "cancer"). Does that all seem correct? Best, John

jjacob12 commented 1 year ago

@GeorgescuC Hi Christophe, just noticed a mistake in my code:ref_group_names=c("normal") is what I need.

jjacob12 commented 1 year ago

Hi @GeorgescuC,

I'm running into problems with the count matrix. This is the code I wrote:

# create the infercnv object
infercnv_obj <- CreateInfercnvObject(raw_counts_matrix = "merge.cbo.ons76cbo.matrix",
                                    annotations_file = "sample.annot.cbo.ons76cbo.txt",
                                    delim = "\t",
                                    gene_order_file = "gencode_v21_gene_pos")

Although the 'raw_counts_matrix' is loaded in RStudio (as are all the required input files), when I attempt to run the code it throws an error:

INFO [2022-09-22 21:13:49] Parsing matrix: merge.cbo.ons76cbo.matrix
Warning in file(file, "rt") :
  cannot open file 'merge.cbo.ons76cbo.matrix': No such file or directory
Error in file(file, "rt") : cannot open the connectionINFO [2022-09-22 21:13:49] Parsing matrix: merge.cbo.ons76cbo.matrix

This is the code I used to generate the count matrix:

write.table(merge.cbo.ons76cbo.matrix, 
            file = "./outputs50K/infercnv/merge.cbo.ons76cbo.matrix",
            row.names = TRUE,
            col.names = TRUE,
            sep = "\t")

To load the count matrix I used:

# load the 'merge.cbo.ons76cbo.matrix' file -
# this is the INPUT COUNT MATRIX FOR INFERCNV.
merge.cbo.ons76cbo.matrix <- read.table("outputs50K/infercnv/merge.cbo.ons76cbo.matrix",
                                        header = TRUE)

This is an image of the count matrix ('merge.cbo.ons76cbo.matrix'):

Screenshot 2022-09-22 at 21 17 11

I'm struggling to see where I have gone wrong. Can you spot the error(s)? Thanks, John

jjacob12 commented 1 year ago

Hi @GeorgescuC, Sorry to bombard you... Better news though. I made some progress on the inputs to the CreateInfercnvObject command. The files are being read in with some tweaks to the earlier code:

infercnv_obj <- CreateInfercnvObject(raw_counts_matrix = "outputs50K/infercnv/merge.cbo.ons76cbo.txt",
                                    annotations_file = "outputs50K/infercnv/sample.annot.cbo.ons76cbo.txt",
                                    delim = "\t",
                                    gene_order_file = "working_data/infercnv_data_public/gencode_v21_gen_pos.complete.txt",
                                    ref_group_names = c("normal"))

However, there is another problem now. Although the code runs, I get this output in RStudio from running the above code:

INFO [2022-09-23 18:10:48] Parsing matrix: outputs50K/infercnv/merge.cbo.ons76cbo.txt
INFO [2022-09-23 18:11:11] Parsing gene order file: working_data/infercnv_data_public/gencode_v21_gen_pos.complete.txt
INFO [2022-09-23 18:11:11] Parsing cell annotations file: outputs50K/infercnv/sample.annot.cbo.ons76cbo.txt
 Show Traceback
Error in CreateInfercnvObject(raw_counts_matrix = "outputs50K/infercnv/merge.cbo.ons76cbo.txt", : Please make sure that all the annotated cell names match a sample in your data matrix. Attention to: 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,170,171,172,173,174,175,176,177,178,179,180,181,182,183,184,185,186,187,188,189,190,191,192,193,194,195,196,197,198,199,200,201,202,203,204,205,206,207,208,209,210,211,212,213,214,215,216,217,218,219,220,221,222,223,224,225,226,227,228,229,230,231,232,233,234,235,236,237,238,239,240,241,242,243,244,245,246,247,248,249

I wonder if this error is telling me that I need to add labels (e.g. "normal", "cancer") to the cells (i.e. column names) in the matrix? Looking at the infercnv Github on inputs to infercnv, the patient sample details seem to be encoded in the count matrix column (i.e. cell) name. However, I am not sure exactly what change needs to be made to the matrix and whether any changes need to be made also to the sample annotation file. I attach a couple of snapshots of the files so you can get a clearer idea.

First, the matrix snapshot (the column names refer only to the cells - recall this is the 10X platform):

Screenshot 2022-09-23 at 18 48 22

Next, the sample annotation file:

Screenshot 2022-09-23 at 18 51 28

What should I do about this error message? Many thanks for your help!

John

jjacob12 commented 1 year ago

Hi @GeorgescuC , Please don't worry about answering the last couple of posts. I have fixed the issues mentioned, but there is a new one - the genomic position file I downloaded from the Broad (https://data.broadinstitute.org/Trinity/CTAT/cnv/) does not match the genes in my count matrix. I am now trying to fix this issue with help from issue #22. Thanks, John

jjacob12 commented 1 year ago

Hi @GeorgescuC, I have made progress with the data wrangling but am still stuck with the inputs to CreateInfercnvObject. When I run this command using the below code:

# create the infercnv object
infercnv_obj <- CreateInfercnvObject(raw_counts_matrix = "outputs50K/infercnv/merge.cbo.ons76cbo.txt",
                                    annotations_file = "outputs50K/infercnv/sample.annot.cbo.ons76cbo.txt",
                                    delim = "\t",
                                    gene_order_file = "working_data/infercnv_data_public/gene_position_file.txt",
                                    ref_group_names = c("normal"))

I get the following output:

INFO [2022-09-25 17:18:23] Parsing matrix: outputs50K/infercnv/merge.cbo.ons76cbo.txt
INFO [2022-09-25 17:18:49] Parsing gene order file: working_data/infercnv_data_public/gene_position_file.txt
Error in read.table(gene_order_file, header = FALSE, row.names = 1, sep = "\t", : duplicate 'row.names' are not allowed

To create the gene_position_file.txt I used the biomaRt package, which took as input ('values=' in getBM function) a vector of the genes in the count matrix. Recall there were 21,649 genes in the raw count matrix. When I searched for the genome positions of those genes, biomaRt outputted a table with ~18,000 genes (rows), which I am a bit confused about, but apparently this discrepancy can happen looking at posts on Bioconductor. I can send the code I used if necessary.

It might be quicker to fix if you can see the input files. I am happy to send these by email to you if that can be arranged.

Thanks, John

GeorgescuC commented 1 year ago

Hi @jjacob12 ,

From the error message, it appears there are duplicate gene names, so I would first check if that is the case in the data you write to file or an issue in the settings used to write the file. Infercnv also accepts directly the gene order as an R object in either "dgCMatrix", "matrix" or "data.frame" formats, so you could try that instead to bypass the file step.

If the issue persists, my email can be found in the DESCRIPTION file.

Regards, Christophe.

jjacob12 commented 1 year ago

Hi @GeorgescuC I think I'm finally getting somewhere. Got rid of all the duplicates, which turned out to be weird genes (e.g. predicted ncRNA genes) without proper HUGO names. In the chromosome ID field of the gene_position file (generated using 'biomaRt') there was no "chr" prefix before the chromosome number. Once I added this, the 'CreateInfercnvObject()' worked. Fingers crossed for the next step!

jjacob12 commented 1 year ago

Hi @GeorgescuC, I've now got the cnv plot, with a infercnv::run() time of around 2.5 hours. I think that is expected. I'm a little confused about the 'cluster_by_groups' argument. From reading 'Interpreting the figure' on the website, I think that argument should be set to FALSE (please see my conditions in the first post). The 'observations' dendrogram/heatmap in my case is made up of a mix of normal cells and some cancer cells, in a single sample, so 'cluster_by_groups' = FALSE seems the right choice. Is that correct? Thanks, John

jjacob12 commented 1 year ago

Hi @GeorgescuC, I am attaching the cnv plot (the final one not the preliminary) using cluster_by_groups= FALSE (visually, I could not see much of a difference whether the latter argument was set to TRUE or FALSE). There is a clear difference between the reference cells and the observation cells. However, I can see small-ish regions in the reference cells that could represent CNVs?? Looking at the GitHub page on 'interpreting the figure', the reference cells in my figure do not look as 'clean' (i.e. free of CNVs) as in your Github figure. Is this put down to, e.g. noise in the reference sample, or is there a high probability of (inferred) CNVs in the reference sample?

To my eye, the observation cells which might be construed to be cancer cells are different for different inferred CNVs. I would expect a group of adjacent cells in the dendrogram to share a common set of CNVs, and did not expect the heterogeneous pattern that is seen, where different observation cells have non-matching CNV patterns. Could that have something to do with the parameters of the ‘CreateInfercnvObject()’ function?

In deciding what are cancer cells vs normal cells in the 'observations' sample, it looks like 'infercnv_obj@observation_grouped_cell_indices' allows the extraction of the cell IDs of the presumed cancer cells. Can the cancer cluster(s) be superimposed visually on a Seurat UMAP plot, and annotated as such? One of the problems I foresee is how to match the CNV plot of the cancer cells (actually ONS-76 medulloblastoma cell line) to either the CCLE or the Sanger collection of cell lines. On these websites, I have not found CNV heat maps for cell lines, with genes in chromosomal order in columns, which would make comparison so much easier. Do you know if either of those collections provide this functionality? Many thanks, John

infercnv

jjacob12 commented 1 year ago

Hi Christophe,

I am attaching the final cnv plot using cluster_by_groups= FALSE (visually, I could not see much of a difference whether the latter argument was set to TRUE or FALSE). There is a difference between the reference cells and the observation cells. However, I can see small-ish regions in the reference cells that could represent CNVs?? Looking at the GitHub page on 'interpreting the figure', the reference cells in my figure do not look as 'clean' (i.e. free of CNVs) as in your Github figure. Is this put down to, e.g. noise in the reference sample, or is there a high probability of (inferred) CNVs in the reference sample?

To my eye, the observation cells which might be construed to be cancer cells are different for different inferred CNVs. I would expect a group of adjacent cells in the dendrogram to share a common set of CNVs, and did not expect the heterogeneous pattern that is seen, where different observation cells have non-matching CNV patterns. Could that have something to do with the parameters of the ‘CreateInfercnvObject()’ function?

In deciding what are cancer cells vs normal cells in the 'observations' sample, it looks like @.***_grouped_cell_indices' allows the extraction of the cell IDs of the presumed cancer cells. Can the cancer cluster(s) be superimposed visually on a Seurat UMAP plot, and annotated as such? One of the problems I foresee is how to match the CNV plot of the cancer cells (actually ONS-76 medulloblastoma cell line) to either the CCLE or the Sanger collection of cell lines. On those websites, I have not found CNV heat maps for cell lines, with genes in chromosomal order in columns, which would make comparison so much easier. Do you know if either of those collections provide this functionality?

Many thanks, John

P.S. I have also posted this question in the GitHub issue page, as not sure which one you will see first. Hope you don’t mind.

On 27 Sep 2022, at 22:50, GeorgescuC @.***> wrote:

Hi @jjacob12 https://github.com/jjacob12 ,

From the error message, it appears there are duplicate gene names, so I would first check if that is the case in the data you write to file or an issue in the settings used to write the file. Infercnv also accepts directly the gene order as an R object in either "dgCMatrix", "matrix" or "data.frame" formats, so you could try that instead to bypass the file step.

If the issue persists, my email can be found in the DESCRIPTION file.

Regards, Christophe.

— Reply to this email directly, view it on GitHub https://github.com/broadinstitute/infercnv/issues/462#issuecomment-1260090981, or unsubscribe https://github.com/notifications/unsubscribe-auth/APEFENFNPCUL563G36B7EP3WANT2HANCNFSM6AAAAAAQRLMBUM. You are receiving this because you were mentioned.

GeorgescuC commented 1 year ago

Hi @jjacob12 ,

The cluster_by_groups option has no impact in your run as there is a single annotation group in the observations (called “cancer”). The option is there for cases where you have multiple annotations groups in your observations (could be patients or cell types for example) to indicate whether they should be treated each on their own or together for subclustering and HMM groups purposes.

From your final plot, it is clear that most of the patterns seen in the observations are not unique to them but are also present in a subset of your references. This can mean that your references are not of a single cell type, that they are not all healthy/normal, or that a subset of the cells that appear healthy/normal already have some of the CNVs present in the cancer cells. Based on the description of your inputs with combining conditions 1 and 4, with condition 1 as the reference, it is likely to be the first case, where you have a few different cell types in your sample, but I would double check the inputs.

To take this into account, you can either run some form of cell clustering in Seurat and export the results as the annotation names, or use infercnv’s num_ref_groups option to separate your references into a given number of clusters, likely at least 3 in your case (although the few very noisy/dissimilar cells in the middle of the reference plot might throw it off slightly). This should result in references with significantly less signal (as each group of references has its average taken as baseline expression) and the matching signal being removed from the observations.

There is however signal seen in your observations that is also seen in small subsets of cells in your reference, such as that gain signal in chr12 or the dual signal seen in chr20, that might not be identified properly when splitting the references based on the hclust alone as num_ref_groups uses.

Regards, Christophe.

jjacob12 commented 1 year ago

Hi @GeorgescuC Thanks very much. I think I reached similar conclusions to you on looking at my CNV plot. I used an alternative strategy to find the cancer cells in condition 4 (based on expression of other markers in a subtype of healthy vs presumed cancer cells, and a correlation analysis comparing gene expression in that cluster of differentiated wild type cells from condition 1 vs. a counterpart cohort of differentiating cancer cells in condition 4 that have a somewhat similar gene expression profile). My idea now is to just focus on the subset of cells which the analyses described above strongly suggest are cancer cells, and use just those as the 'observation' group. For the 'reference group', I will use your suggestions to choose those cells. Best, John