chenlingantelope / MSscRNAseq2019

Analysis for 2019 submission "Integrated single cell analysis of blood and cerebrospinal fluid leukocytes in multiple sclerosis" Schafflick1, Xu, Hartlehnert1 et. al
MIT License
21 stars 8 forks source link

Datasets.ipynb: The logic of converting raw data to "all_datasets.pkl" #2

Closed qiaochen closed 3 years ago

qiaochen commented 4 years ago

Hi, I am trying to execute the code, but am stuck in the Datasets.ipynb notebook. Could you please elaborate on how the raw data are arranged in the folder.

Are the CSF datasets merged into one single file? How is the file "/data/yosef2/users/chenling/CSF/CSF_data/celltypes.txt" generated?

Thanks!

LuShuYangMing commented 4 years ago

And where is the Cell disease state that mentioned in README?

inuritdino commented 4 years ago

I agree that the merging procedure is not described and it is really hard to reproduce it from the raw data (in GEO). Maybe, you could provide us with all_datasets.pkl or all_data.mtx files to get the annotation right. On the other hand, the disease state is fully tractable from the sample names ("MS" vs "PTC/PST" patterns) and isMS/isCSF variables (given the full/merged dataset is provided).

Would be able to kindly provide us with further instructions? Thanks in advance!

chenlingantelope commented 3 years ago

├── MS19270 │   ├── CSF │   │   └── GRCh38 │   │   ├── barcodes.tsv │   │   ├── genes.tsv │   │   └── matrix.mtx │   └── PBMCs │   └── GRCh38 │   ├── barcodes.tsv │   ├── genes.tsv │   └── matrix.mtx ├── MS49131 │   ├── CSF │   │   └── GRCh38 │   │   ├── barcodes.tsv │   │   ├── genes.tsv │   │   └── matrix.mtx │   └── PBMCs │   └── GRCh38 │   ├── barcodes.tsv │   ├── genes.tsv │   └── matrix.mtx ├── MS58637 │   └── CSF │   └── GRCh38 │   ├── barcodes.tsv │   ├── genes.tsv │   └── matrix.mtx ├── MS60249 │   ├── CSF │   │   └── GRCh38 │   │   ├── barcodes.tsv │   │   ├── genes.tsv │   │   └── matrix.mtx │   └── PBMCs │   └── GRCh38 │   ├── barcodes.tsv │   ├── genes.tsv │   └── matrix.mtx ├── MS71658 │   ├── CSF │   │   └── GRCh38 │   │   ├── barcodes.tsv │   │   ├── genes.tsv │   │   └── matrix.mtx │   └── PBMCs │   └── GRCh38 │   ├── barcodes.tsv │   ├── genes.tsv │   └── matrix.mtx ├── MS74594 │   ├── CSF │   │   └── GRCh38 │   │   ├── barcodes.tsv │   │   ├── genes.tsv │   │   └── matrix.mtx │   └── PBMCs │   └── GRCh38 │   ├── barcodes.tsv │   ├── genes.tsv │   └── matrix.mtx ├── PST45044 │   └── CSF │   └── GRCh38 │   ├── barcodes.tsv │   ├── genes.tsv │   └── matrix.mtx ├── PST83775 │   ├── CSF │   │   └── GRCh38 │   │   ├── barcodes.tsv │   │   ├── genes.tsv │   │   └── matrix.mtx │   └── PBMCs │   └── GRCh38 │   ├── barcodes.tsv │   ├── genes.tsv │   └── matrix.mtx ├── PST95809 │   ├── CSF │   │   └── GRCh38 │   │   ├── barcodes.tsv │   │   ├── genes.tsv │   │   └── matrix.mtx │   └── PBMCs │   └── GRCh38 │   ├── barcodes.tsv │   ├── genes.tsv │   └── matrix.mtx ├── PTC32190 │   ├── CSF │   │   └── GRCh38 │   │   ├── barcodes.tsv │   │   ├── genes.tsv │   │   └── matrix.mtx │   └── PBMCs │   └── GRCh38 │   ├── barcodes.tsv │   ├── genes.tsv │   └── matrix.mtx ├── PTC41540 │   ├── CSF │   │   └── GRCh38 │   │   ├── barcodes.tsv │   │   ├── genes.tsv │   │   └── matrix.mtx │   └── PBMCs │   └── GRCh38 │   ├── barcodes.tsv │   ├── genes.tsv │   └── matrix.mtx └── PTC85037 ├── CSF │   └── GRCh38 │   ├── barcodes.tsv │   ├── genes.tsv │   └── matrix.mtx └── PBMCs └── GRCh38 ├── barcodes.tsv ├── genes.tsv └── matrix.mtx

chenlingantelope commented 3 years ago

Here is the data structure for using the dataset.ipynb. The bardocdes, genes, and matrix files are in the Supplementary Files in the geo accession. Since GEO does not accept identical file names, the files are named by their full path: Essentially I replaced all "/" with "_".

chenlingantelope commented 3 years ago

'PCT32190_CSF', 'PTC41540_CSF', 'PST45044_CSF', 'PTC85037_CSF', 'MS58637_CSF', 'MS19270CSF', 'MS71658CSF', 'MS49131CSF' are the samples used in the CSF comparisons.