Teichlab / celltypist

A tool for semi-automatic cell type classification
https://www.celltypist.org/
MIT License
260 stars 40 forks source link

raw training dataset #39

Closed mjstrumillo closed 1 year ago

mjstrumillo commented 1 year ago

is it possible to obtain the raw training dataset?

ChuanXu1 commented 1 year ago

@mjstrumillo, the latest raw training data is here

mjstrumillo commented 1 year ago

thank you!

mjstrumillo commented 1 year ago

Hi, me again with more questions: in the raw dataset that you shared with me there is one column for version (1 and 2) Version one consists of:

'Bjorklund et al. 2016': 646,
 'Braga et al. 2019': 8694,
 'HCA Immune 2018': 82864,
 'James et al. 2020': 25232,
 'Madissoon et al. 2020': 37937,
 'Martin et al. 2019': 5892,
 'Miller et al. 2020': 986,
 'Miragaia et al. 2019': 288,
 'Park et al. 2020': 194146,
 'Popescu et al. 2019': 44116,
 'Smillie et al. 2019': 5275,
 'Stewart et al. 2019': 3193,
 'Szabo et al. 2019': 598,
 'Vento-Tormo et al. 2018': 23176,
 'Voigt et al. 2019': 468,
 'Zhang et al. 2018': 1915,
 'Zheng et al. 2017': 6313}

version 2: {'Dominguez Conde et al. 2022': 233869} the Dominguez Conde is the paper that announces those 300k cells - so Im assuming this is the latest version of the training dataset, am I correct ? theres also this file available https://celltypist.cog.sanger.ac.uk/cellxgene/CellTypist_training_data_for_cellxgene.h5ad that consists of :

'Braga et al. 2019': 8928,
 'HCA Immune 2018': 281995,
 'Jaitin et al. 2019': 2562,
 'James et al. 2020': 29289,
 'Li et al. 2019': 1448,
 'Madissoon et al. 2020': 56421,
 'Martin et al. 2019': 12698,
 'Miller et al. 2020': 2615,
 'Miragaia et al. 2019': 1094,
 'Park et al. 2020': 206788,
 'Popescu et al. 2019': 46004,
 'Smillie et al. 2019': 11378,
 'Stewart et al. 2019': 5518,
 'Szabo et al. 2019': 16355,
 'Vento-Tormo et al. 2018': 26244,
 'Voigt et al. 2019': 755,
 'Zhang et al. 2018': 5980,
 'Zheng et al. 2017': 21928}

the main difference is it doesnt have dominguez and conde, and way more HCA, and Li and Jaitlin and different numbers for most of the cells - is this an older version of the training dataset, eg. is this what 300k cells were assigned on?

sorry for this nitpicking, Im just trying to understand the workflow here

ChuanXu1 commented 1 year ago

@mjstrumillo, the latter one was old training dataset. The former (with raw counts) is the most updated training dataset. In the new version, I removed some low-quality or low-confidence cells/datasets, and added Dominguez Conde et al. 2022.

mjstrumillo commented 1 year ago

:) its me again. Thank you for all your answers, I really appreciate them. OK, so looking at the CellIDs between the previous dataset and the new dataset, in the popescu subset, there are only 23 cells with the same CellID

['4834STDY7002878_GGTGCGTTCGTACGGC-1-popescu19',
 'FCAImmP7179363_AAACGGGCAATAACGA-1-popescu19',
 '4834STDY7002878_GGCTCGACAATGGTCT-1-popescu19',
 '4834STDY7002878_GTATTCTAGAAGGTTT-1-popescu19',
 '4834STDY7002877_AAACCTGGTATAGGTA-1-popescu19',
 '4834STDY7002877_AAGCCGCAGCACCGTC-1-popescu19',
 '4834STDY7002877_AAATGCCTCTTACCGC-1-popescu19',
 '4834STDY7002877_TGCCAAAGTGGCTCCA-1-popescu19',
 'FCAImmP7179363_AACTGGTAGATCTGCT-1-popescu19',
 '4834STDY7002877_CTTACCGTCATGTCTT-1-popescu19',
 '4834STDY7002882_AGATTGCTCCTCGCAT-1-popescu19',
 '4834STDY7002877_CGGACTGCATGAGCGA-1-popescu19',
 '4834STDY7002877_CATCAGATCACATAGC-1-popescu19',
 '4834STDY7038750_ACGCCAGGTGATGCCC-1-popescu19',
 '4834STDY7002877_AAGCCGCAGCTAAGAT-1-popescu19',
 '4834STDY7002878_AAATGCCCATCCGGGT-1-popescu19',
 '4834STDY7002877_ACATCAGTCACCCGAG-1-popescu19',
 '4834STDY7002878_AAACCTGCAGGAACGT-1-popescu19',
 '4834STDY7002877_CGTAGGCCATGCCTAA-1-popescu19',
 '4834STDY7002878_CAGCTGGAGGCTAGAC-1-popescu19',
 '4834STDY7002878_AAACCTGGTCTGCCAG-1-popescu19',
 '4834STDY7002877_ATCCACCGTGCAACGA-1-popescu19',
 '4834STDY7002877_AGTTGGTTCGGAGGTA-1-popescu19']

what Im trying to ask is - I took the popescu cells from the old dataset, and annotated them additionaly and did something to them. Now, because the new dataset is better annotated/filtered I wanted to take those, but I assumed they should be pretty much the same cells, so I wanted to map back my annotations to V2. The total of the original popescu dataset says "140,000 liver and 74,000 skin, kidney and yolk sac cells, "

in your previous dataset you had 44116 cells form popescu, in the new one 46004. Only 23 CellIDs match between those two dataset. Does that mean you pretty much exchanged all cells from Popescu, or did the CellIDs changed in a specific matter? I intuitively thought you kind of filtered down V1 and added Dominguez-Conde, but it seems like its not just "filtered down", but more criteria was involved?

THANK YOU SO MUCH IN ADVANCE <3

ChuanXu1 commented 1 year ago

@mjstrumillo, cell names should be matched - you only need some manipulation of cell ID strings. When adding Dominguez-Conde et al. to the V1, I do a cross-prediction, and filter cells which are not confident, leading to a reduced cell set with more high-confidence cells.

mjstrumillo commented 1 year ago

Thank you so much, yes that was an easy fix, I forgot to reply to you, just needed to strip() the names. but here I go again - in the V1 of the Dominguez-Conde https://www.tissueimmunecellatlas.org/ the annotations are slightly different than for the training dataset - what I mean is theres 4 columns (Predicted Labels CellTypist, Majority Voting CellTypist, Majority Voting CellTypist High and Manually curated celltype) - I am trying to decipher with the encyclopedia on which one is which however often they're inconsistent - is the manually curated the final one? Which one is the actual annotation of the highest quality? I tried to compare the ones common in V1 and V2 and understand what drives the final annotations but no luck. This is totally a 2023 problem, so no rush on this, thank you!

ChuanXu1 commented 1 year ago

@mjstrumillo, the data you got is the most updated one (v2).

maxim-h commented 1 year ago

@ChuanXu1 Thanks for all the great work. I was also curious about getting the training data for Immune_All models.

Unfortunately it looks like the link you shared before is no longer working : http://celltypist.cog.sanger.ac.uk/rm/CellTypist_Immune_Reference_v2_count.h5ad

Was it perhaps moved or renamed?

ChuanXu1 commented 1 year ago

@maxim-h, I will put the raw count-related data to a stable location in the CellTypist website soon.

ChuanXu1 commented 1 year ago

@maxim-h, please use the same link to download it for now.

maxim-h commented 1 year ago

@ChuanXu1 Thank you!

wang-qf commented 1 year ago

hi, Xu, i have a question about the Immune altas train dataset. in the paper Dominguez Conde et al. Science 376, eabl5197 (2022): you use v1 or v2 ref dataset as the train in fig1?

ChuanXu1 commented 1 year ago

@wang-qf, v1