We will need some way to measure the confidence in the label assignment obtained from running SingleR with a given reference. One way of doing this would be to shuffle the sample labels of the reference dataset prior to training/ identifying marker genes and classifying cell types in the test dataset. This should be done over a set number of permutations to obtain a distribution of cell type assignments for each cell in the test dataset. We can then compare the true label to the distribution of assignments to obtain a p-value.
Before we can do this we will need to figure out the following:
The run time for training and classifying to see an estimate of how long we think this will take.
Figure out either a score or some sort of output value that is comparable across runs of SingleR to use for comparing the true label to the distribution. For example, if we can use the score computed by SingleR, we would compare the score for a given cell corresponding to the true label to the score for that cell for that same true label across each permutation.
We will probably want to have a function that takes as input the sce object of interest and the reference data to be used. Then within that function, the permutations will be performed prior to running SingleR. We should also use parallelization whenever possible.
After evaluating some of the results from SingleR, we are not immediately planning on using this for the qc report, but we may return to this at a later point.
We will need some way to measure the confidence in the label assignment obtained from running
SingleR
with a given reference. One way of doing this would be to shuffle the sample labels of the reference dataset prior to training/ identifying marker genes and classifying cell types in the test dataset. This should be done over a set number of permutations to obtain a distribution of cell type assignments for each cell in the test dataset. We can then compare the true label to the distribution of assignments to obtain a p-value.Before we can do this we will need to figure out the following:
SingleR
to use for comparing the true label to the distribution. For example, if we can use the score computed bySingleR
, we would compare the score for a given cell corresponding to the true label to the score for that cell for that same true label across each permutation.We will probably want to have a function that takes as input the sce object of interest and the reference data to be used. Then within that function, the permutations will be performed prior to running
SingleR
. We should also use parallelization whenever possible.