custom reference h5ad - Githubissues

Flu09 commented 7 months ago

I downloaded an h5ad reference from cellxgene, how to use singler to use to annotate my samples?

jkanche commented 7 months ago

Most of the singler methods require a count matrix (in a gene X cell format) and a feature vector containing gene symbols.

As long as you extract these objects from the H5AD file, you should be able to run the annotate methods.

let us know if you are running into any issues.

Flu09 commented 6 months ago

is there any example or vignette because it is not clear to me?

jkanche commented 6 months ago

I would suggest looking at the docs from anndata to read the H5AD file.

import anndata
data = anndata.read_h5ad(<PATH_TO_FILE>)

Then extract the features and the count matrix:

features = data.var[<SYMBOL_COLUMN>]
count_matrix = data.layers[<KEY_TO_COUNTS>]

Once you have these two objects, you can follow the instructions in the quick start from the README - https://github.com/BiocPy/singler?tab=readme-ov-file#quick-start.

Flu09 commented 6 months ago

I faced the following error

built <singler.build_single_reference.SinglePrebuiltReference object at 0x14d2253cf9a0>

output = singler.classify_single_reference( ... test_mat, ... test_features=test_features, ... ref_prebuilt=built, ... )

Traceback (most recent call last): File "", line 1, in File "/home/.virtualenvs/r-reticulate/lib64/python3.9/site-packages/singler/classify_single_reference.py", line 104, in classify_single_reference raise KeyError("failed to find gene '" + str(x) + "' in the test dataset") KeyError: "failed to find gene 'BTBD11' in the test dataset"

LTLA commented 6 months ago

For starters, it's not clear what commands you actually ran; suggest providing a minimum reproducible example.

In any case, the error is pretty self-explanatory; the gene BTBD11 is in the reference dataset and it's not in the test dataset, hence the classification fails. This shouldn't be a problem if you use annotate_single(), which only considers genes that are both present in the test and reference dataset. If you're using build_single_reference() and classify_single_reference() directly, you're responsible for handling the differences in the feature space.

Flu09 commented 6 months ago

I see thank you so much. I used the annotate_single() according to the documentation. result = singler.annotate_single(test_data=test_mat, test_features=test_features, ref_data=matrix, ref_labels=ref_labels, ref_features=ref_features, num_threads = 30)

Any explanation for the output how to use it to annotate the cells?

Flu09 commented 6 months ago

result BiocFrame(data={'best': ['Deep-layer intratelencephalic', 'Hippocampal CA4', 'Cerebellar inhibitory', ..., 'Upper-layer intratelencephalic', 'Deep-layer intratelencephalic', 'MGE interneuron'], 'scores': BiocFrame(data={'Upper rhombic lip': array([0.27891899, 0.15571988, 0.19137761, ..., 0.2323916 , 0.29004175, 0.15923283]), 'Splatter': array([0.46681682, 0.20016138, 0.19682275, ..., 0.39786858, 0.47337265, 0.19662641]), 'Lower rhombic lip': array([0.46287853, 0.18958362, 0.16550371, ..., 0.37490354, 0.48360727, 0.17461614]), ..., 'Deep-layer intratelencephalic': array([0.74889269, 0.21071689, 0.18335148, ..., 0.5667374 , 0.73362752, 0.17355748]), 'Hippocampal dentate gyrus': array([0.56422535, 0.1891379 , 0.18225307, ..., 0.46425141, 0.56804178, 0.18210765]), 'Hippocampal CA4': array([0.58995576, 0.21592968, 0.16995776, ..., 0.43359463, 0.5855626 , 0.1722497 ])}, number_of_rows=3510, column_names=['Upper rhombic lip', 'Splatter', 'Lower rhombic lip', ..., 'Deep-layer intratelencephalic', 'Hippocampal dentate gyrus', 'Hippocampal CA4']), 'delta': array([0.10845515, 0.00922214, 0.06185248, ..., 0.00156091, 0.10489623, 0.00346725])}, number_of_rows=3510, column_names=['best', 'scores', 'delta'], metadata={'markers': {'Upper rhombic lip': {'Upper rhombic lip': [], 'Splatter': ['ENSG00000151789', 'ENSG00000081803', 'ENSG00000106069', ..., 'ENSG00000152822', 'ENSG00000154127', 'ENSG00000157890'], 'Lower rhombic lip': ['ENSG00000151789', 'ENSG00000184672', 'ENSG00000081803', ..., 'ENSG00000257242', 'ENSG00000257923', 'ENSG00000275342'], ..., 'Deep-layer intratelencephalic': ['ENSG00000151789', 'ENSG00000184408', 'ENSG00000168843', ..., 'ENSG00000205683', 'ENSG00000228566', 'ENSG00000232046'], 'Hippocampal dentate gyrus': ['ENSG00000151789', 'ENSG00000081803', 'ENSG00000168843', ..., 'ENSG00000152127', 'ENSG00000152270', 'ENSG00000154127'], 'Hippocampal CA4': ['ENSG00000151789', 'ENSG00000145526', 'ENSG00000168843', ..., 'ENSG00000172349', 'ENSG00000172572', 'ENSG00000179104']}, 'Splatter': {'Upper rhombic lip': ['ENSG00000251562', 'ENSG00000224078', 'ENSG00000174469', ..., 'ENSG00000198712', 'ENSG00000255794', 'ENSG00000075151'], 'Splatter': [], 'Lower rhombic lip': ['ENSG00000184672', 'ENSG00000183117', 'ENSG00000169855', ..., 'ENSG00000130338', 'ENSG00000133083', 'ENSG00000133424'], ..., 'Deep-layer intratelencephalic': ['ENSG00000169855', 'ENSG00000152208', 'ENSG00000255794', ..., 'ENSG00000260232', 'ENSG00000004864', 'ENSG00000006704'], 'Hippocampal dentate gyrus': ['ENSG00000152208', 'ENSG00000174469', 'ENSG00000169855', ..., 'ENSG00000109265', 'ENSG00000109472', 'ENSG00000112232'], 'Hippocampal CA4': ['ENSG00000152208', 'ENSG00000176204', 'ENSG00000175497', ..., 'ENSG00000111640', 'ENSG00000117632', 'ENSG00000120885']}, 'Lower rhombic lip': {'Upper rhombic lip': ['ENSG00000251562', 'ENSG00000157168', 'ENSG00000174469', ..., 'ENSG00000245532', 'ENSG00000048740', 'ENSG00000107518'], 'Splatter': ['ENSG00000251562', 'ENSG00000157168', 'ENSG00000185008', ..., 'ENSG00000182348', 'ENSG00000185053', 'ENSG00000185420'], 'Lower rhombic lip': [], ..., 'Deep-layer intratelencephalic': ['ENSG00000251562', 'ENSG00000157168', 'ENSG00000185008', ..., 'ENSG00000249853', 'ENSG00000276644', 'ENSG00000286637'], 'Hippocampal dentate gyrus': ['ENSG00000251562', 'ENSG00000157168', 'ENSG00000185008', ..., 'ENSG00000184349', 'ENSG00000184611', 'ENSG00000197959'], 'Hippocampal CA4': ['ENSG00000157168', 'ENSG00000185008', 'ENSG00000179399', ..., 'ENSG00000172348', 'ENSG00000181072', 'ENSG00000226320']}, ..., 'Deep-layer intratelencephalic': {'Upper rhombic lip': ['ENSG00000251562', 'ENSG00000185774', 'ENSG00000078328', ..., 'ENSG00000169760', 'ENSG00000091129', 'ENSG00000132639'], 'Splatter': ['ENSG00000251562', 'ENSG00000185774', 'ENSG00000078328', ..., 'ENSG00000168959', 'ENSG00000182901', 'ENSG00000184156'], 'Lower rhombic lip': ['ENSG00000185774', 'ENSG00000183117', 'ENSG00000175497', ..., 'ENSG00000139970', 'ENSG00000145864', 'ENSG00000149970'], ..., 'Deep-layer intratelencephalic': [], 'Hippocampal dentate gyrus': ['ENSG00000251562', 'ENSG00000153707', 'ENSG00000175497', ..., 'ENSG00000169744', 'ENSG00000182732', 'ENSG00000198010'], 'Hippocampal CA4': ['ENSG00000185774', 'ENSG00000153707', 'ENSG00000175497', ..., 'ENSG00000148516', 'ENSG00000152583', 'ENSG00000154975']}, 'Hippocampal dentate gyrus': {'Upper rhombic lip': ['ENSG00000251562', 'ENSG00000185774', 'ENSG00000183117', ..., 'ENSG00000197555', 'ENSG00000244128', 'ENSG00000273079'], 'Splatter': ['ENSG00000251562', 'ENSG00000185774', 'ENSG00000183117', ..., 'ENSG00000221866', 'ENSG00000115896', 'ENSG00000135298'], 'Lower rhombic lip': ['ENSG00000183117', 'ENSG00000150672', 'ENSG00000185774', ..., 'ENSG00000145934', 'ENSG00000150275', 'ENSG00000153956'], ..., 'Deep-layer intratelencephalic': ['ENSG00000181722', 'ENSG00000139220', 'ENSG00000154654', ..., 'ENSG00000184613', 'ENSG00000196730', 'ENSG00000197555'], 'Hippocampal dentate gyrus': [], 'Hippocampal CA4': ['ENSG00000185774', 'ENSG00000139220', 'ENSG00000176204', ..., 'ENSG00000134343', 'ENSG00000136895', 'ENSG00000139173']}, 'Hippocampal CA4': {'Upper rhombic lip': ['ENSG00000251562', 'ENSG00000185565', 'ENSG00000078328', ..., 'ENSG00000102466', 'ENSG00000182168', 'ENSG00000196628'], 'Splatter': ['ENSG00000251562', 'ENSG00000078328', 'ENSG00000175161', ..., 'ENSG00000116106', 'ENSG00000151474', 'ENSG00000066032'], 'Lower rhombic lip': ['ENSG00000251562', 'ENSG00000183715', 'ENSG00000183117', ..., 'ENSG00000183454', 'ENSG00000185046', 'ENSG00000253553'], ..., 'Deep-layer intratelencephalic': ['ENSG00000251562', 'ENSG00000185565', 'ENSG00000183715', ..., 'ENSG00000155974', 'ENSG00000164176', 'ENSG00000185274'], 'Hippocampal dentate gyrus': ['ENSG00000251562', 'ENSG00000183715', 'ENSG00000113448', ..., 'ENSG00000132639', 'ENSG00000185420', 'ENSG00000099250'], 'Hippocampal CA4': []}}, 'unique_markers': ['ENSG00000001630', 'ENSG00000002746', 'ENSG00000004864', ..., 'ENSG00000286863', 'ENSG00000287290', 'ENSG00000287292']})

jkanche commented 6 months ago

What you get as a result is a biocframe (similar to a dataframe) that contains 3 columns for each cell in your test matrix, (@LTLA can correct me if i am wrong)

"best": the best label assigned to the cell
"scores": score based on spearman correlation across markers compared to the reference
"delta": difference between the score for the "best" label compared to the second best. Low delta might indicate uncertainty in the label assignment

The OSCA book is a good start and goes into more details.

LTLA commented 6 months ago

what he said

jkanche commented 3 months ago

closing this, but do reach out if you have any questions.

BiocPy / singler

custom reference h5ad #22