Open allyhawkins opened 2 years ago
I've gone ahead and summarized the ScPCA projects based on the libraries that we have received thus far and are currently accounted for in the scpca-library-metadata.tsv
and scpca-sample-metadata.tsv
. I grouped all of the libraries by submitter, project ID, diagnosis, technology and sequencing unit and totaled how many libraries were found in each group. I attached a table that includes the entire summary which shows the distribution of libraries for each project. For the most part each project included one main group where the majority of samples could be attributed to one main diagnosis, technology, and seq unit and then there was a handful of other libraries that are a mix of other types.
I took the larger table and filtered it to only show the top two groups of libraries per project/diagnosis/ tech/seq unit combination. That is the table that's showed here and I think what we would want to start with. I did not include libraries that were used in multiplexing and only considered single-cell/ single-nuclei samples. Additionally if a project has NA
for the ID those are projects we have not processed yet.
submitter | scpca_project_id | diagnosis | seq_unit | n | technology |
---|---|---|---|---|---|
collins | NA | Osteosarcoma | nucleus | 10 | 10Xv2_5prime |
dyer_chen | SCPCP000004 | Neuroblastoma | nucleus | 14 | 10Xv3.1 |
dyer_chen | SCPCP000004 | Neuroblastoma | cell | 5 | 10Xv2 |
dyer_chen | SCPCP000005 | Rhabdomyosarcoma | nucleus | 27 | 10Xv3 |
dyer_chen | SCPCP000005 | Rhabdomyosarcoma | cell | 13 | 10Xv2 |
dyer_chen | NA | Retinoblastoma | cell | 28 | 10Xv2 |
dyer_chen | NA | Retinoblastoma | nucleus | 7 | 10Xv3.1 |
gawad | SCPCP000007 | Acute myeloid leukemia | cell | 26 | 10Xv2_5prime, CITEseq_10Xv2 |
green_mulcahy_levy | SCPCP000001 | Glioblastoma | cell | 16 | 10Xv3 |
green_mulcahy_levy | SCPCP000002 | Pilocytic astrocytoma | cell | 18 | 10Xv3 |
green_mulcahy_levy | SCPCP000002 | Ganglioglioma | cell | 5 | 10Xv3 |
mullighan | SCPCP000008 | B-cell acute lymphoblastic leukemia | cell | 94 | 10Xv2_5prime |
mullighan | SCPCP000008 | Mixed phenotype acute leukemia | cell | 6 | 10Xv2_5prime |
murphy_chen | SCPCP000006 | Wilms tumor | nucleus | 40 | 10Xv3.1 |
pugh | NA | Low-grade glioma/astrocytoma (WHO grade I/II) | nucleus | 22 | 10Xv2_5prime |
pugh | NA | Ganglioglioma | nucleus | 9 | 10Xv2_5prime |
teachey_tan | SCPCP000003 | Early T-cell precursor T-cell acute lymphoblastic leukemia | cell | 31 | 10Xv3, CITEseq_10Xv3 |
teachey_tan | SCPCP000003 | Non-early T-cell precursor T-cell acute lymphoblastic leukemia | cell | 11 | 10Xv3, CITEseq_10Xv3 |
Below is the code used to generate this abbreviated table and then the attached complete table:
# read in sample and library metadata and filter to relevant columns
library_df <- readr::read_tsv(library_metadata_file) |>
dplyr::select(scpca_sample_id, scpca_library_id, technology, seq_unit)
sample_df <- readr::read_tsv(sample_metadata_file) |>
dplyr::select(scpca_sample_id, scpca_project_id, submitter, diagnosis)
# get list of library ids that are hashed
cell_hash_libraries <- library_df |>
dplyr::filter(technology %in% "cellhash_10Xv3.1") |>
dplyr::pull(scpca_library_id)
# combine library info with sample data and filter to only single-cell/nucleus
all_info_df <- library_df |>
dplyr::left_join(sample_df) |>
dplyr::filter(seq_unit %in% c("cell", "nucleus"),
!scpca_library_id %in% cell_hash_libraries)
summarized_info <- all_info_df |>
# count total number of libraries
dplyr::count(submitter, scpca_project_id, diagnosis, technology, seq_unit) |>
# combine cite-seq with counterpart library so as to not show up as duplicate row
dplyr::group_by(submitter, scpca_project_id, diagnosis, seq_unit, n) |>
dplyr::summarise(technology = paste(unique(technology), collapse = ", ")) |>
# sort by submitter and then total libraries
dplyr::group_by(submitter, scpca_project_id) |>
dplyr::arrange(desc(n), .by_group = TRUE)
# filter to only top 2 groups per submitter and number of libraries > 2
filtered_summarized_info <- summarized_info |>
dplyr::slice(1:2) |>
dplyr::filter(n > 2)
# write out full table
readr::write_tsv(summarized_info, "scpca-project-summary.tsv")
# write out filtered table
readr::write_tsv(filtered_summarized_info, "scpca-filtered-project-summary.tsv")
Some initial thoughts when looking at this and based on our conversation yesterday... I think we should consider testing a dataset(s) where we may be able to validate integration with cell type assignments. We could probably pick any of the blood datasets to do this, but we have already started doing a lot with the Gawad data and it looks to represent the average size for number of libraries we are going to have so we might consider doing automatic cell type annotation before and after integration and testing integration starting with that dataset. We probably would also want to try with multiple projects that are blood datasets if we did this to make sure results are consistent. But I know that cell type annotation comes with its own set of challenges.
Alternatively, we can continue along our similar trend of testing along different tissue types but expand to consider entire datasets rather than a subset of samples. However, I still worry about being limited in how we can interpret those results without fully understanding what we expect the integration to look like for each dataset, since we don't know the distribution of overlapping cell types or cell states.
Tagging @sjspielman @jashapiro @jaclyn-taroni @cbethell for any thoughts on this.
What if we also prioritize the one dataset that has a preprint associated with it (rhabdomyosarcoma)? There might be gene sets for the different cell types associated with the preprint itself.
As we are evaluating integration methods and testing subsets of datasets, it would be helpful to see the exact breakdown of ScPCA projects and how much of each disease, 10X platform, etc. are included. Right now for integration we plan to approach integrating only data that are from the same project, disease type, sequencing unit, and 10X platform so we want to see what that breakdown looks like. From there we may pick a few projects to test the integration workflow on, removing the python methods before testing.
We can use https://github.com/AlexsLemonade/sc-data-integration/issues/3#issuecomment-1130239566 as a reference for building this table.