AllenInstitute / cell_type_mapper

Repository for storing prototype functionality implementations for the BKP
Other
11 stars 2 forks source link

Missing runner-up assignments #13

Closed mdeber closed 5 months ago

mdeber commented 6 months ago

Hi,

So far I've haven't been obtaining runner-up assignments in the output files. We're mapping to the ABC WMB taxonomy 20231215 (using the WMB-10Xv2 and WMB-10Xv3 expression matrices), and cell_type_mapper source code is current with what's on branch main.

python -m cell_type_mapper.cli.from_specified_markers \
    --query_path="/path/to/data.h5ad" \
    --extended_result_path="${out_dir}/extended_result.json" \
    --csv_result_path="${out_dir}/result_table.csv" \
    --precomputed_stats.path="/path/to/precomputed_stats.h5" \
    --query_markers.serialized_lookup="/path/to/abc_marker_genes.json" \
    --drop_level="supertype" \
    --flatten=True \
    --type_assignment.normalization="raw" \
    --type_assignment.bootstrap_iteration 1 \
    --type_assignment.bootstrap_factor 1.0 \
    --type_assignment.n_runners_up 5;

The output looks like:

$ head -28 ${out_dir}/extended_result.json
{
  "results": [
    {
      "CCN20230722_CLUS": {
        "assignment": "CS20230722_CLUS_4347",
        "bootstrapping_probability": 1.0,
        "avg_correlation": 0.6862122410689558,
        "runner_up_assignment": [],
        "runner_up_correlation": [],
        "runner_up_probability": []
      },
      "cell_id": "TGCCTGTTCGTTAGTA-1",
      "CCN20230722_SUPT": {
        "assignment": "CS20230722_SUPT_0974",
        "bootstrapping_probability": 1.0,
        "avg_correlation": 0.6862122410689558
      },
      "CCN20230722_SUBC": {
        "assignment": "CS20230722_SUBC_243",
        "bootstrapping_probability": 1.0,
        "avg_correlation": 0.6862122410689558
      },
      "CCN20230722_CLAS": {
        "assignment": "CS20230722_CLAS_24",
        "bootstrapping_probability": 1.0,
        "avg_correlation": 0.6862122410689558
      }
    },

And the "runnerup" fields are always empty lists:

$ grep "runner_up_" ${out_dir}/extended_result.json | sort | uniq
        "runner_up_assignment": [],
        "runner_up_correlation": [],
        "runner_up_probability": []

Our immediate aim is to use the runner-up assignments to blacklist a couple of recurring problematic cell type assignments.

Thanks in advance!

Mike DeBerardine, PhD Labs of Fenna Krienen and Cate Peña Princeton Neuroscience Institute

danielsf commented 6 months ago

@mdeber

In order to get runner up assignments, you need bootstrap_iteration > 1 and bootstrap_factor < 1.0.

bootstrap_iteration sets how many times each cell is run through the mapping. If you only have one mapping, there will be no runners up, since you are essentially running an election with one voter.

bootstrap_factor sets the fraction of all marker genes that are used in each iteration, i.e. if bootstrap_factor = 0.75, then each individual iteration will use a randomly selected three quarters of all the available marker genes to do its mapping. If bootstrap_factor = 1.0, each iteration will use all of the marker genes and come to the same result (each voter in your election will cast the exact same vote; apparently, I'm sticking with this analogy).

Things to be aware of:

With flatten = True, setting bootstrap_iteration = N will slow the mapping down by a factor of N relative to bootstrap_iteration = 1.

I suspect you will get more interesting runner up assignments with a lower bootstrap_factor (maybe as low as 0.5....?). The lower the value of bootstrap_factor, the more different each bootstrap iteration's subset of marker genes will be.