Missing metadata columns in scenicplus object created for non multiome data

mervedede commented 11 months ago

Hello,

I have an experiment that has scRNA-seq and scATAC-seq data from different cells but which are from the same samples, so I have been following the "Tutorial: Mix of melanoma cell lines". For both of my scRNA and scATAC-seq datasets, I have metadata that shows the "sample ID" (individual ids for samples as the name suggests and these are matched between RNA and ATAC), "cell type" which has the cell type annotations for the cells and a "response" column (just specifies yes or no). My question is during the creation of the scenicplus object, I'm using the sample ID column for the "key_to_group_by" parameter so that all of my samples are represented in the scenicplus object. However, I have noticed that once my scenicplus object is created, it no longer has the metadata for "cell type" or "response" columns and now only has the "sample ID" as metadata_cell.

This is a big issue for me since I want to perform downstream comparisons such as comparing eRegulon activity scores etc based on the different cell types or the response classifications in my data. Is there a way to make sure that the other metadata columns from scRNA or scATAC are kept during the creation of the scenicplus object just like in the 10x multiome:pbmc tutorial, in which as you indicate, "_Cell metadata coming from the cistopicobj will be prefixed with the string ACC and metadata coming from the adata object will be prefixed with the string GEX_"? Or maybe another way to link the metacells back to the original metadata annotations ?

Thanks

SeppeDeWinter commented 11 months ago

Hi

The reason these columns are missing is because after sampling cells from both modalities the metadata for the metacells might be a mixed.

Let me illustrate with an example:

Let's say your metadata looks like this:

  sample_id cell_type
0         a         x
1         a         y
2         a         x
3         b         z
4         b         x
5         b         x
6         c         w
7         c         j
8         c         j

Generating metacells based on the sample id will result in three metacells (a, b and c) with:

a having cell type annotations to: x and y
b having cell type annotations to: z and x
c having cell type annotations to: w and j

For this reason it is up to the user to assign the metadata to the metacells afterwards. Therefore, it is very important to choose the variable on which to generate metacells carefully so you are able to assign each metacell to the proper annotations.

In your case I would suggest including the "response" variable in the metacell generation (i.e. generate seperate metacells for cells labeled as responders and non-responders).

I hope this clarifies things, if not feel free to ping me again.

Best,

Seppe

mervedede commented 11 months ago

Hi Seppe,

Thank you very much for your response, very helpful explanation as well as suggestions. I have some followup questions:

1) I understand that in the case of cell type, one sample will map to several cell types. But, for my "response" column for example, each sample ID will map to a single response. So if the sample ID is on the metadata_cell info, why can we not have the response column that matches to it?

2) You mentioned: "it is up to the user to assign the metadata to the metacells afterwards". Does this mean I can add metadata to metacells after the scenicplus object is created? If so, would using the scplus_obj.add_cell_data function be the best way?

3)You suggested including the "response" variable in the metacell generation , I guess, instead of sample ID if I am not misunderstanding you because I don't think we can include both, right? If this is the case, does this mean the user would have to re-generate separate scenicplus objects and therefore separate downstream scenicplus analyses for each metadata comparison they want to investigate?

Thanks

SeppeDeWinter commented 11 months ago

I understand that in the case of cell type, one sample will map to several cell types. But, for my "response" column for example, each sample ID will map to a single response. So if the sample ID is on the metadata_cell info, why can we not have the response column that matches to it?

It is programatically possible to do this, but this is unfortunately not how the function is implemented at the moment.

You mentioned: "it is up to the user to assign the metadata to the metacells afterwards". Does this mean I can add metadata to metacells after the scenicplus object is created? If so, would using the scplus_obj.add_cell_data function be the best way?

Yes, you can add metadata afterwards. Either using the scplus_obj.add_cell_data function or by directly adding this data to the scplus_obj.cell_metadata DataFrame.

You suggested including the "response" variable in the metacell generation , I guess, instead of sample ID if I am not misunderstanding you because I don't think we can include both, right? If this is the case, does this mean the user would have to re-generate separate scenicplus objects and therefore separate downstream scenicplus analyses for each metadata comparison they want to investigate?

No, I'm not suggesting generating seperate scenicplus objects. You could generate a third metadata variable that contains both the sample id and the response. e.g.


  sample_id   response  sample_id_response
0         a         yes         a_yes
1         a         yes         a_yes
2         a         no          a_no
3         b         yes         b_yes
4         b         yes         b_yes
5         b         yes         b_yes
6         c         yes         b_yes
7         c         no          c_no
8         c         no          c_no

best,

Seppe

aertslab / scenicplus

Missing metadata columns in scenicplus object created for non multiome data #190