Closed healer963 closed 1 month ago
Hello,
yes I selected only a subset of data (I will call it a training set for convenience) by holding cells whose "batch" label is "s2d4" as a test set. The test set was held for the imputation experiment. If you are using python and anndata you can use these two lines to reproduce the subset of data for each data set (GEX+ADT and GEX+ATAC):
train_obs_names = gex.obs_names[gex.obs["batch"] != "s2d4"]
test_obs_names = gex.obs_names[gex.obs["batch"] == "s2d4"]
I am also confused by the donor number in this data set. The data set was originally published for Openproblems_neurips2021 competition. If you go to the website, it says 12 donors which is different from what's written in the paper (10 donors).
I just checked the GEO page and found this:
So GEX+ADT has 9 donors in total (without batch=="s2d4", it will be 8 donors) And GEX+ATAC has 10 donors in total (without batch=="s2d4", it will be 9 donors)
Hope this is helpful for you!
Thank you very much for your answer. I will know more about it. Thank you
The GSE194122 dataset The data sets are mentioned in two places, one says eight donors and the other says nine donors. The GEO website seems to have a total of 10 donors in this data set. Did you make a selection? Could you please provide the complete data set of the experiment? We validated the usability of scMaui in these regards using single-cell RNA-seq and ATAC-seq from healthy bone marrow samples (GSE194122 [25]). Here, we used 63,138 cells provided by 9 donors and processed in 4 different sites. 10 populations including 22 subpopulations were The human bone marrow dataset (GSE194122) provides very detailed cell-type labels (45 labels for RNA-seq and ADTs data, 22 labels for RNA-seq and ATAC-seq data). Thus, we further coarsely annotated each cell by grouping given cell-type labels. In this study, the provided cell-type labels and the new group annotations are referred to as subpopulation and population, respectively. Supplementary Tables 1 and 2 show the annotation matches of population and subpopulation in individual datasets. For the benchmarking, we used 84,677 cells in total including 11 populations which comprise 45 subpopulations. These were collected from 8 different donors and processed in 4 sites, creating a multiple batch effect landscape representative of many real-world scenarios.