AlexsLemonade / sc-data-integration

0 stars 0 forks source link

Add cell type information to SCE objects for HCA data #51

Closed allyhawkins closed 2 years ago

allyhawkins commented 2 years ago

In order to use scANVI in #17, we will need to obtain the identified cell type information for each of the HCA datasets. Although the HCA datasets all have cell type information, it is in no way uniform... so gathering this is not going to be a simple task. Each individual project submitted their own formatted version of cell type information. Some of this is available as h5 files, some are tsvs, and others are excel files with supplemental tables in each sheet. The projects that we are working with are all ones that were processed using the HCA pipeline so the output we are currently working with is uniformly processed, however the cell type annotation is not uniformly available.

One thought I had was that the cell type labels might be in the larger loom files that contain the entire integrated dataset available on the portal, but I checked reading in one of those and the cell type data is not there either. I think what we will want to do is gather the annotations from HCA and make a reference tsv file that has the cell barcodes, cell types, library ID and project name and then use that to create a cell type column for all of the SCE objects. Cell type labels will also be helpful for metrics so something we probably want to do regardless of integration type for benchmarking.

allyhawkins commented 2 years ago

Since #84 was filed, we now want to address this and add in a column with the cell type ground truths to the colData slot of the SCE objects. I updated the title of this issue to show that this is independent of scANVI.

allyhawkins commented 2 years ago

Also we may want to start by tackling 1 project at first and creating a reference cell type TSV from the supplementary information available on HCA. Then from there we can create a script that reads in the reference TSV and merge the cell type information with the SCE object. Once we establish a way to systematically assign cell types we could gather the cell type information from the rest of the projects since it will be on a project by project basis regarding how the information is stored. Definitely open to other ideas if others have opinions!

jaclyn-taroni commented 2 years ago

What do you mean by "systematically assign cell types" here @allyhawkins? Since we expect that it will be stored differently for each project in HCA, I'm not sure I follow.

allyhawkins commented 2 years ago

Just noting here what we discussed in DSTM. We first plan to make sure that the cell type information is all in the same format, converting them in the format they are in from HCA to a standard format. Then we can use the same function or script to grab the cell type information from that standardized data and incorporate it into the SCE object.

allyhawkins commented 2 years ago

Closed by #99