AllenInstitute / cell_type_mapper

Repository for storing prototype functionality implementations for the BKP
Other
18 stars 4 forks source link

change in output information #3

Closed mkunst23 closed 10 months ago

mkunst23 commented 1 year ago

Hi,

I have a request for the simple correlation based mapping (flat mapping). In addition to the best correlated cell type per query cell with it's average correlation score, can you also output a list of the 25 next best cluster with it's associated correlation scores?

Thanks, Michael

leechangkyu commented 1 year ago

Michael, correlation mapping result has a field “map.freq” in addition to best.map.df. “map.freq” report all clusters with average correlation that each cell is mapped to out of N (default 100) bootstrapping.

Please check this output whether it serves your purpose.

Thanks CK

Get Outlook for iOShttps://aka.ms/o0ukef


From: Michael Kunst @.> Sent: Friday, June 16, 2023 7:05:11 AM To: AllenInstitute/knowledge_graph_prototypes @.> Cc: Subscribed @.***> Subject: [AllenInstitute/knowledge_graph_prototypes] change in output information (Issue #3)

Hi,

I have a request for the simple correlation based mapping (flat mapping). In addition to the best correlated cell type per query cell with it's average correlation score, can you also output a list of the 25 next best cluster with it's associated correlation scores?

Thanks, Michael

— Reply to this email directly, view it on GitHubhttps://github.com/AllenInstitute/knowledge_graph_prototypes/issues/3, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AEXNJFVWSUTFHARHA3ODBV3XLN2IPANCNFSM6AAAAAAZIMSCAM. You are receiving this because you are subscribed to this thread.Message ID: @.***>

danielsf commented 1 year ago

@mkunst23

Given that flat mapping works as follows

1) Randomly select 90% of marker genes 2) Find the most correlated cluster 3) Repeat (1) and (2) 100 times with a different 90% set of marker genes 4) Choose the cluster that came up "most correlated" in the plurality of the 100 iterations

How do you want to define "25 next best clusters"? Is this the 25 clusters that got the 2nd-25th most votes from bootstrapping?

Or do we need to choose the N most correlated clusters in (2) and come with a more complicated "vote counting" scheme that accounts for "clusterA was most correlated 15 times and second-most-correlated 10 times..."

?

mkunst23 commented 1 year ago

Hi Scott,

I would pick the first option. That way we can measure mapping quality by how often it confuses it with the nan-majority cluster.

danielsf commented 1 year ago

so glad you said that: it will be the easiest to implement (once I can focus on this, which will clearly be middle of next week)

danielsf commented 1 year ago

@mkunst23

I am finally getting around to addressing this issue.

My initial thought was to record the 25 "runner up" clusters and their average correlation coefficients in the extended output JSON file. This, however, would blow up that already large file from 2 GB to 16 GB (for the 4 million cell MERFISH data), so I think I may need to abandon my dream of an output JSON blob and accept the reality that we need to use a pandas dataframe written out to HDF5.

I have two schemes in mind. I've simulated examples here

/allen/aibs/technology/danielsf/knowledge_base/scratch/output_design

many_df.h5 records each level of the taxonomy in a separate dataframe. In Python, you would get the dataframe of cluster assignments with

import pandas
cluster_df = pandas.read_hdf('many_df.h5', key='CCN20230504_CLUS')

Similarly, you would get the dataframe of subclass assignments with

subclass_df = pandas.read_hdf('many_df.h5', key='CCN20230504_SUBC')

etc. Each dataframe has the same columns. The runner up assignments are in columns named runner_up_[0-25] and the corresponding correlation coefficients are in runner_up_[0-25]_cor (please note that this data is all randomly generated; I just wanted to simulate the shape).

single_df.h5 records all of the results at all taxonomic levels in a single dataframe. The columns a prefixed with the name of the taxonomic level, i.e.

CCN20230504_CLAS_assignment,
CCN20230504_CLAS_bootstrapping_probability,
...
CCN20230504_SUBC_assignment,
CCN20230504_SUBC_assignment,
...

The dataframe can be read in with

import pandas
df = pandas.read_hdf('single_df.h5', key='results')

I prefer the many_df.h5 shape. I do not like prefixing the column names with the taxonomic level. I'm not a fan of long column names. Is there a shape you prefer (can either of these be easily accessed in R)?

danielsf commented 10 months ago

This was addressed a long time ago. The mapping tool now has an n_runners_up config parameter that specifies how many runner up assignments to output.