Create a menu of scpca projects to be considered for integration

allyhawkins commented 2 years ago

As we are evaluating integration methods and testing subsets of datasets, it would be helpful to see the exact breakdown of ScPCA projects and how much of each disease, 10X platform, etc. are included. Right now for integration we plan to approach integrating only data that are from the same project, disease type, sequencing unit, and 10X platform so we want to see what that breakdown looks like. From there we may pick a few projects to test the integration workflow on, removing the python methods before testing.

We can use https://github.com/AlexsLemonade/sc-data-integration/issues/3#issuecomment-1130239566 as a reference for building this table.

allyhawkins commented 2 years ago

I've gone ahead and summarized the ScPCA projects based on the libraries that we have received thus far and are currently accounted for in the scpca-library-metadata.tsv and scpca-sample-metadata.tsv. I grouped all of the libraries by submitter, project ID, diagnosis, technology and sequencing unit and totaled how many libraries were found in each group. I attached a table that includes the entire summary which shows the distribution of libraries for each project. For the most part each project included one main group where the majority of samples could be attributed to one main diagnosis, technology, and seq unit and then there was a handful of other libraries that are a mix of other types.

I took the larger table and filtered it to only show the top two groups of libraries per project/diagnosis/ tech/seq unit combination. That is the table that's showed here and I think what we would want to start with. I did not include libraries that were used in multiplexing and only considered single-cell/ single-nuclei samples. Additionally if a project has NA for the ID those are projects we have not processed yet.

submitter	scpca_project_id	diagnosis	seq_unit	n	technology
collins	NA	Osteosarcoma	nucleus	10	10Xv2_5prime
dyer_chen	SCPCP000004	Neuroblastoma	nucleus	14	10Xv3.1
dyer_chen	SCPCP000004	Neuroblastoma	cell	5	10Xv2
dyer_chen	SCPCP000005	Rhabdomyosarcoma	nucleus	27	10Xv3
dyer_chen	SCPCP000005	Rhabdomyosarcoma	cell	13	10Xv2
dyer_chen	NA	Retinoblastoma	cell	28	10Xv2
dyer_chen	NA	Retinoblastoma	nucleus	7	10Xv3.1
gawad	SCPCP000007	Acute myeloid leukemia	cell	26	10Xv2_5prime, CITEseq_10Xv2
green_mulcahy_levy	SCPCP000001	Glioblastoma	cell	16	10Xv3
green_mulcahy_levy	SCPCP000002	Pilocytic astrocytoma	cell	18	10Xv3
green_mulcahy_levy	SCPCP000002	Ganglioglioma	cell	5	10Xv3
mullighan	SCPCP000008	B-cell acute lymphoblastic leukemia	cell	94	10Xv2_5prime
mullighan	SCPCP000008	Mixed phenotype acute leukemia	cell	6	10Xv2_5prime
murphy_chen	SCPCP000006	Wilms tumor	nucleus	40	10Xv3.1
pugh	NA	Low-grade glioma/astrocytoma (WHO grade I/II)	nucleus	22	10Xv2_5prime
pugh	NA	Ganglioglioma	nucleus	9	10Xv2_5prime
teachey_tan	SCPCP000003	Early T-cell precursor T-cell acute lymphoblastic leukemia	cell	31	10Xv3, CITEseq_10Xv3
teachey_tan	SCPCP000003	Non-early T-cell precursor T-cell acute lymphoblastic leukemia	cell	11	10Xv3, CITEseq_10Xv3

Below is the code used to generate this abbreviated table and then the attached complete table:

# read in sample and library metadata and filter to relevant columns
library_df <- readr::read_tsv(library_metadata_file) |>
  dplyr::select(scpca_sample_id, scpca_library_id, technology, seq_unit)
sample_df <- readr::read_tsv(sample_metadata_file) |> 
  dplyr::select(scpca_sample_id, scpca_project_id, submitter, diagnosis)

# get list of library ids that are hashed
cell_hash_libraries <- library_df |>
  dplyr::filter(technology %in% "cellhash_10Xv3.1") |>
  dplyr::pull(scpca_library_id)

# combine library info with sample data and filter to only single-cell/nucleus
all_info_df <- library_df |>
  dplyr::left_join(sample_df) |>
  dplyr::filter(seq_unit %in% c("cell", "nucleus"),
                !scpca_library_id %in% cell_hash_libraries)

summarized_info <- all_info_df |> 
  # count total number of libraries
  dplyr::count(submitter, scpca_project_id, diagnosis, technology, seq_unit) |>
  # combine cite-seq with counterpart library so as to not show up as duplicate row
  dplyr::group_by(submitter, scpca_project_id, diagnosis, seq_unit, n) |>
  dplyr::summarise(technology = paste(unique(technology), collapse = ", ")) |>
  # sort by submitter and then total libraries
  dplyr::group_by(submitter, scpca_project_id) |>
  dplyr::arrange(desc(n), .by_group = TRUE)

# filter to only top 2 groups per submitter and number of libraries > 2
filtered_summarized_info <- summarized_info |> 
  dplyr::slice(1:2) |> 
  dplyr::filter(n > 2)

# write out full table 
readr::write_tsv(summarized_info, "scpca-project-summary.tsv")

# write out filtered table
readr::write_tsv(filtered_summarized_info, "scpca-filtered-project-summary.tsv")

Some initial thoughts when looking at this and based on our conversation yesterday... I think we should consider testing a dataset(s) where we may be able to validate integration with cell type assignments. We could probably pick any of the blood datasets to do this, but we have already started doing a lot with the Gawad data and it looks to represent the average size for number of libraries we are going to have so we might consider doing automatic cell type annotation before and after integration and testing integration starting with that dataset. We probably would also want to try with multiple projects that are blood datasets if we did this to make sure results are consistent. But I know that cell type annotation comes with its own set of challenges.

Alternatively, we can continue along our similar trend of testing along different tissue types but expand to consider entire datasets rather than a subset of samples. However, I still worry about being limited in how we can interpret those results without fully understanding what we expect the integration to look like for each dataset, since we don't know the distribution of overlapping cell types or cell states.

Tagging @sjspielman @jashapiro @jaclyn-taroni @cbethell for any thoughts on this.

scpca-project-summary.tsv.zip

jaclyn-taroni commented 2 years ago

What if we also prioritize the one dataset that has a preprint associated with it (rhabdomyosarcoma)? There might be gene sets for the different cell types associated with the preprint itself.

AlexsLemonade / sc-data-integration

Create a menu of scpca projects to be considered for integration #159