Closed mkunst23 closed 10 months ago
Michael, correlation mapping result has a field “map.freq” in addition to best.map.df. “map.freq” report all clusters with average correlation that each cell is mapped to out of N (default 100) bootstrapping.
Please check this output whether it serves your purpose.
Thanks CK
Get Outlook for iOShttps://aka.ms/o0ukef
From: Michael Kunst @.> Sent: Friday, June 16, 2023 7:05:11 AM To: AllenInstitute/knowledge_graph_prototypes @.> Cc: Subscribed @.***> Subject: [AllenInstitute/knowledge_graph_prototypes] change in output information (Issue #3)
Hi,
I have a request for the simple correlation based mapping (flat mapping). In addition to the best correlated cell type per query cell with it's average correlation score, can you also output a list of the 25 next best cluster with it's associated correlation scores?
Thanks, Michael
— Reply to this email directly, view it on GitHubhttps://github.com/AllenInstitute/knowledge_graph_prototypes/issues/3, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AEXNJFVWSUTFHARHA3ODBV3XLN2IPANCNFSM6AAAAAAZIMSCAM. You are receiving this because you are subscribed to this thread.Message ID: @.***>
@mkunst23
Given that flat mapping works as follows
1) Randomly select 90% of marker genes 2) Find the most correlated cluster 3) Repeat (1) and (2) 100 times with a different 90% set of marker genes 4) Choose the cluster that came up "most correlated" in the plurality of the 100 iterations
How do you want to define "25 next best clusters"? Is this the 25 clusters that got the 2nd-25th most votes from bootstrapping?
Or do we need to choose the N most correlated clusters in (2) and come with a more complicated "vote counting" scheme that accounts for "clusterA was most correlated 15 times and second-most-correlated 10 times..."
?
Hi Scott,
I would pick the first option. That way we can measure mapping quality by how often it confuses it with the nan-majority cluster.
so glad you said that: it will be the easiest to implement (once I can focus on this, which will clearly be middle of next week)
@mkunst23
I am finally getting around to addressing this issue.
My initial thought was to record the 25 "runner up" clusters and their average correlation coefficients in the extended output JSON file. This, however, would blow up that already large file from 2 GB to 16 GB (for the 4 million cell MERFISH data), so I think I may need to abandon my dream of an output JSON blob and accept the reality that we need to use a pandas dataframe written out to HDF5.
I have two schemes in mind. I've simulated examples here
/allen/aibs/technology/danielsf/knowledge_base/scratch/output_design
many_df.h5
records each level of the taxonomy in a separate dataframe. In Python, you would get the dataframe of cluster assignments with
import pandas
cluster_df = pandas.read_hdf('many_df.h5', key='CCN20230504_CLUS')
Similarly, you would get the dataframe of subclass assignments with
subclass_df = pandas.read_hdf('many_df.h5', key='CCN20230504_SUBC')
etc. Each dataframe has the same columns. The runner up assignments are in columns named runner_up_[0-25]
and the corresponding correlation coefficients are in runner_up_[0-25]_cor
(please note that this data is all randomly generated; I just wanted to simulate the shape).
single_df.h5
records all of the results at all taxonomic levels in a single dataframe. The columns a prefixed with the name of the taxonomic level, i.e.
CCN20230504_CLAS_assignment,
CCN20230504_CLAS_bootstrapping_probability,
...
CCN20230504_SUBC_assignment,
CCN20230504_SUBC_assignment,
...
The dataframe can be read in with
import pandas
df = pandas.read_hdf('single_df.h5', key='results')
I prefer the many_df.h5
shape. I do not like prefixing the column names with the taxonomic level. I'm not a fan of long column names. Is there a shape you prefer (can either of these be easily accessed in R)?
This was addressed a long time ago. The mapping tool now has an n_runners_up
config parameter that specifies how many runner up assignments to output.
Hi,
I have a request for the simple correlation based mapping (flat mapping). In addition to the best correlated cell type per query cell with it's average correlation score, can you also output a list of the 25 next best cluster with it's associated correlation scores?
Thanks, Michael