AlexsLemonade / sc-data-integration

0 stars 0 forks source link

Create a menu of scpca projects to be considered for integration #159

Open allyhawkins opened 2 years ago

allyhawkins commented 2 years ago

As we are evaluating integration methods and testing subsets of datasets, it would be helpful to see the exact breakdown of ScPCA projects and how much of each disease, 10X platform, etc. are included. Right now for integration we plan to approach integrating only data that are from the same project, disease type, sequencing unit, and 10X platform so we want to see what that breakdown looks like. From there we may pick a few projects to test the integration workflow on, removing the python methods before testing.

We can use https://github.com/AlexsLemonade/sc-data-integration/issues/3#issuecomment-1130239566 as a reference for building this table.

allyhawkins commented 2 years ago

I've gone ahead and summarized the ScPCA projects based on the libraries that we have received thus far and are currently accounted for in the scpca-library-metadata.tsv and scpca-sample-metadata.tsv. I grouped all of the libraries by submitter, project ID, diagnosis, technology and sequencing unit and totaled how many libraries were found in each group. I attached a table that includes the entire summary which shows the distribution of libraries for each project. For the most part each project included one main group where the majority of samples could be attributed to one main diagnosis, technology, and seq unit and then there was a handful of other libraries that are a mix of other types.

I took the larger table and filtered it to only show the top two groups of libraries per project/diagnosis/ tech/seq unit combination. That is the table that's showed here and I think what we would want to start with. I did not include libraries that were used in multiplexing and only considered single-cell/ single-nuclei samples. Additionally if a project has NA for the ID those are projects we have not processed yet.

submitter scpca_project_id diagnosis seq_unit n technology
collins NA Osteosarcoma nucleus 10 10Xv2_5prime
dyer_chen SCPCP000004 Neuroblastoma nucleus 14 10Xv3.1
dyer_chen SCPCP000004 Neuroblastoma cell 5 10Xv2
dyer_chen SCPCP000005 Rhabdomyosarcoma nucleus 27 10Xv3
dyer_chen SCPCP000005 Rhabdomyosarcoma cell 13 10Xv2
dyer_chen NA Retinoblastoma cell 28 10Xv2
dyer_chen NA Retinoblastoma nucleus 7 10Xv3.1
gawad SCPCP000007 Acute myeloid leukemia cell 26 10Xv2_5prime, CITEseq_10Xv2
green_mulcahy_levy SCPCP000001 Glioblastoma cell 16 10Xv3
green_mulcahy_levy SCPCP000002 Pilocytic astrocytoma cell 18 10Xv3
green_mulcahy_levy SCPCP000002 Ganglioglioma cell 5 10Xv3
mullighan SCPCP000008 B-cell acute lymphoblastic leukemia cell 94 10Xv2_5prime
mullighan SCPCP000008 Mixed phenotype acute leukemia cell 6 10Xv2_5prime
murphy_chen SCPCP000006 Wilms tumor nucleus 40 10Xv3.1
pugh NA Low-grade glioma/astrocytoma (WHO grade I/II) nucleus 22 10Xv2_5prime
pugh NA Ganglioglioma nucleus 9 10Xv2_5prime
teachey_tan SCPCP000003 Early T-cell precursor T-cell acute lymphoblastic leukemia cell 31 10Xv3, CITEseq_10Xv3
teachey_tan SCPCP000003 Non-early T-cell precursor T-cell acute lymphoblastic leukemia cell 11 10Xv3, CITEseq_10Xv3

Below is the code used to generate this abbreviated table and then the attached complete table:

# read in sample and library metadata and filter to relevant columns
library_df <- readr::read_tsv(library_metadata_file) |>
  dplyr::select(scpca_sample_id, scpca_library_id, technology, seq_unit)
sample_df <- readr::read_tsv(sample_metadata_file) |> 
  dplyr::select(scpca_sample_id, scpca_project_id, submitter, diagnosis)

# get list of library ids that are hashed
cell_hash_libraries <- library_df |>
  dplyr::filter(technology %in% "cellhash_10Xv3.1") |>
  dplyr::pull(scpca_library_id)

# combine library info with sample data and filter to only single-cell/nucleus
all_info_df <- library_df |>
  dplyr::left_join(sample_df) |>
  dplyr::filter(seq_unit %in% c("cell", "nucleus"),
                !scpca_library_id %in% cell_hash_libraries)

summarized_info <- all_info_df |> 
  # count total number of libraries
  dplyr::count(submitter, scpca_project_id, diagnosis, technology, seq_unit) |>
  # combine cite-seq with counterpart library so as to not show up as duplicate row
  dplyr::group_by(submitter, scpca_project_id, diagnosis, seq_unit, n) |>
  dplyr::summarise(technology = paste(unique(technology), collapse = ", ")) |>
  # sort by submitter and then total libraries
  dplyr::group_by(submitter, scpca_project_id) |>
  dplyr::arrange(desc(n), .by_group = TRUE)

# filter to only top 2 groups per submitter and number of libraries > 2
filtered_summarized_info <- summarized_info |> 
  dplyr::slice(1:2) |> 
  dplyr::filter(n > 2)

# write out full table 
readr::write_tsv(summarized_info, "scpca-project-summary.tsv")

# write out filtered table
readr::write_tsv(filtered_summarized_info, "scpca-filtered-project-summary.tsv")

Some initial thoughts when looking at this and based on our conversation yesterday... I think we should consider testing a dataset(s) where we may be able to validate integration with cell type assignments. We could probably pick any of the blood datasets to do this, but we have already started doing a lot with the Gawad data and it looks to represent the average size for number of libraries we are going to have so we might consider doing automatic cell type annotation before and after integration and testing integration starting with that dataset. We probably would also want to try with multiple projects that are blood datasets if we did this to make sure results are consistent. But I know that cell type annotation comes with its own set of challenges.

Alternatively, we can continue along our similar trend of testing along different tissue types but expand to consider entire datasets rather than a subset of samples. However, I still worry about being limited in how we can interpret those results without fully understanding what we expect the integration to look like for each dataset, since we don't know the distribution of overlapping cell types or cell states.

Tagging @sjspielman @jashapiro @jaclyn-taroni @cbethell for any thoughts on this.

scpca-project-summary.tsv.zip

jaclyn-taroni commented 2 years ago

What if we also prioritize the one dataset that has a preprint associated with it (rhabdomyosarcoma)? There might be gene sets for the different cell types associated with the preprint itself.