Open KaczorowskiLab opened 1 month ago
Hi @KaczorowskiLab
This is a use case I did not anticipate. I am assuming that the problem is that CS20230722_CLUS_001 does not appear in the region for which you are constructing a taxonomy. The way the code is set up, the nodes on leaf level of the taxonomy tree (in our case the CLUS_NNNN nodes) end up pointing to lists of the cells associated with that leaf node. If you create the taxonomy tree with cell_metadata_path = None
, this step gets skipped, and each cluster points to an empty list. If you create the taxonomy tree with non-NULL cell_metadata_path
, it is expected that every cluster will have some cells associated with it (if not, you would probably end up with trouble when you go to compute the mean expression values of cell type clusters, since there will be some divide-by-zero errors running around).
Unless you want to re-engineer cluster_annotation_term.csv
and cluster_to_cluster_annotation_membership.csv
to only include cell types that occur in the brain region you are working with, I think this is going to require a fix on our end.
What I don't know for sure (and would appreciate your feedback on) is: what is the right way to handle this? My inclination is to add a step to the taxonomy construction code that trims the tree, removing all cell types that have zero cells assigned to them (effectively pretending that cell types which occur outside of your region of interest do not exist). Is this acceptable to you, or do you need/want those cell types to persist in the taxonomy*?
*if they did persist in the taxonomy, I don't think that anything would ever get mapped to them, as they would all have NULL mean expression profiles.
Let me know what you think of this rough sketch of a solution.
By the way: I am excited to see what comes of this analysis. Though I didn't anticipate this particular use case, that was just a deficiency in my imagination. What you are proposing ought to be possible in a perfect world.
Thank you for the detailed explanation and proposed solutions @danielsf ! I appreciate the clarification regarding the root cause of the issue. I will give re-engineering the cluster specific CSVs a shot.
Since the re-engineering the CSV files has constraints and potential problems, I find the second solution to trim trees of cell types with zero cells to be practical and efficient.
However, I have some concerns about the potential implications of trimming the tree, particularly in scenarios such as:
To mitigate these concerns, is a hybrid approach where the cell types with zero counts are flagged and excluded rather than permanently removed? This would provide the ability to retain the complete taxonomy structure for future use while avoiding immediate computational issues.
Please let me know your thoughts on this approach. Thank you for your assistance!
Flagging and excluding empty cell types would take a lot more coding to implement, since several steps in the pipeline involve iterating over the taxonomy and all of its nodes. Each of those steps would have to be altered to check for the validity of the cell type.
In principle, we could just fill in the leaf level of the taxonomy with empty lists of cells wherever necessary. I think that will bog down the code, though, as the marker gene finder tries to find markers to distinguish between cell types with no data assigned to them. It would also undermine the advantage of doing a region-specific taxonomy (which, I assume, is finding marker genes that are targeted for discriminating only between cell types that exist in the chosen region), since the taxonomy would still appear to have all ~ 5,000 cell type clusters from the full Whole Mouse Brain data release.
This branch implements the solution I originally proposed (it had the virtue of only taking a few hours to code up). If you add --do_pruning True
to your command line call to cell_type_mapper.cli.precompute_stats_abc
, it will add a step before validating that taxonomy that removes any nodes that are not directly connected to lists of cells. Nothing about the inheritance structure of the existing taxonomy will be changed. It's just that cell types you don't "need" (because they aren't in your reference dataset) will be removed from the tree.
If you want a reference taxonomy that includes all of the nodes, you could do something like
from cell_type_mapper.taxonomy.taxonomy_tree import TaxonomyTree
tree = TaxonomyTree.from_data_release(
cell_metadata_path='/path/to/full/cell_metadata.csv',
cluster_annotation_path='/path/to/cluster_annotation_term.csv',
cluster_membership_path='/path/to/cluster_to_cluster_annotation_membership.csv',
hierarchy=['CCN20230722_CLAS', 'CCN20230722_SUBC', 'CCN20230722_SUPT', 'CCN20230722_CLUS'],
do_pruning=False
)
# save the tree to a text file in JSON serialized form
with open('reference_taxonomy.json', 'w') as src:
src.write(tree.to_str(indent=2, drop_cells=True))
The form of the taxonomy tree as serialized to reference_taxonomy.json
is documented on this page.
Does this solution work for you?
Thanks @danielsf for the quick implementation! I will give this solution a try. I am curious to know if one would expect to see major differences going this route when compared to building a single region specific h5ad file and using the python -m cell_type_mapper.cli.precompute_stats_scrattch
instead ?
Building a region-specific h5ad file and calling precompute_stats_scrattch
should achieve exactly the same result as the solution in the branch I pointed you towards.
precompute_stats_scrattch
infers the cell type taxonomy tree from the annotations applied to each individual cell in the h5ad file's obs
dataframe. Therefore, there cannot be any cell types with no cells associated with them. If no cells are annotated to belong to a cell type, that cell type will not appear in the inferred taxonomy tree.
The problem you've run into with precompute_stats_abc
is that the inheritance structure of the taxonomy tree is defined in cluster_annotation_term.csv
and cluster_to_cluster_annotation_membership.csv
, while the assignment of cells to cell types is defined in cell_metadata.csv
. There is technically nothing keeping those files in sync. At the point where you replaced the full cell_metadata.csv
with single-region cell_metadata.csv
, you brought the cell-to-cluster assignment out of sync with the cluster-to-supertype-to-subclass-to-class assignment. There were now clusters in cluster_annotation_term.csv
and cluster_to_cluster_annotation_membership.csv
that had not associated cells.
So: if you feel comfortable assembling a region-specific h5ad file in which the cell type annotations are in the obs
dataframe, then: yes. That would be another way to work around this problem.
Incidentally, this Jupyter notebook constructs an h5ad file with cartoon data and a cartoon taxonomy and then walks through all of the steps in the process of mapping data onto that cartoon taxonomy, starting with precompute_stats_scrattch
, in case it is helpful.
Thank you so much for the quick clarification. I was able to complete the single h5ad route to use python -m cell_type_mapper.cli.precompute_stats_scrattch
. However, since the WMB data is distributed across chemistries, in the cell_type_mapper.cli.map_to_on_the_fly_markers
I see an option to pass multiple precomputed files (--reference_markers.precomputed_path_list
). My current approach collapses all of the data into a single file, but it is possible to modify the code to create multiple precompute_stats files. Does the cell_type_mapper.cli.precompute_stats_abc
with the newly implemented --do_pruning True
also keep each region and chemistry files separate ? The 10Xv2 and 10Xv3 files are easier to manage but the 10XMulti files need an additional subsetting step since all regions are collapsed into one file.
The new branch (precomputed_stats_abc
with --do_pruning True
) ought to keep the chemistry files separate.
Hi @danielsf,
After successful creation of region specific multiple chemistry files, I am running into an error that I am a bit confused about. I am running python -m cell_type_mapper.cli.map_to_on_the_fly_markers
using the
--reference_markers.precomputed_path_list
option to supply the list of all of the reference precomputed stat files.
Not supplying the --precomputed_stats.path
in the above command gives me a ValidationError
. When I do supply the same reference list to this argument, it complains that it is not a file. So based on this, can I only run pairwise mapping between the query and the precomputed files ?
There is a subtlety that I obviously failed to capture in the documentation.
--reference_markers.precomputed_path_list
and --precomputed_stats.path
serve two different purposes. --reference_markers.precomputed_path_list
points to the set of precomputed stats files which are used for finding marker genes. This is the step at which different chemistries can be kept separate. --precomputed_stats.path
is used as the source for the mean gene expression profile against which cells are compared during the actual mapping process. At this mapping stage, there is no point in keeping the chemistries separate. The way the algorithm works, a cell type can only have one mean gene expression profile for purposes of mapping.
When you created the multiple precomputed_stats files, you should have seen files like
ouptut/directory/precomputed_stats.combined.h5
output/directory/precomputed_stats.WMB-10XMulti.h5
output/directory/precomputed_stats.WMB-10Xv2.h5
output/directory/precomputed_stats.WMB-10Xv3.h5
The WMB-10X*
files are the files that should be passed to --reference_markers.precomputed_path_list
. The *combined.h5
file is the file that should be passed to --precomputed_stats.path
. It is a single file in which each cell type is represented by the chemistry in which it was most prevalent.
Sorry for the confusion. I'll try to put together an example notebook demonstrating this process (probably after SfN is past).
Quick note: the example notebook I pointed you towards intentionally sets n_processors
to a very low value. This is because the data being processed is very small, and I want users to be able to run the example on their laptop. Marker gene selection can be very expensive, so I would recommend setting n_processors to as high as you can on whatever machine you are using. The back-of-the-envelope estimate is that, for the full Whole Mouse Brain taxonomy, marker gene selection took 3 hours with 32 cores and ~ 64 GB of memory. The process scales as N^2 where N is the number of leaf nodes in the taxonomy. I know you have drastically reduced this number from the full 5322, but I'm not sure by how much.
Thank you for the clarification! The errors and the description had me stumped. I am hopeful about the process running without any more errors! If you are open to it, I am also happy to contribute to the documentation to showcase the region specific reference file setup as an example.
@danielsf - Is there a threshold for minimum cells to be included in the reference files to lead to meangingful analysis? Another error I ran into which took a long time to debug was due to the sparse amount of cells in one of the chemistry files contributing to the region specific reference file. The precompute_stats_abc
step ran without issues.
But when I move on to the mapping step using map_to_on_the_fly_markers
, I ran into ValueError: All chunk dimensions must be positive
. The initial error log pointed towards a problem with the query dataset. But I ran into the same problem using the reference_markers
pipeline. This error stemmed from the _merge_sparse_by_pair_files
function in markers.py which has a step to calculate chunk dimensions based on the n_up_indices
or n_down_indices
which ends up being 0 for the reference file with the sparse data. The problem was solved by excluding this file and re-running the pipeline. In this case, the extreme sparse data in this reference file made it easier to identify the problem, but in cases when the distinction is unclear, it would be helpful to know some metric or way to validate the reference data.
Good debugging.
I don't have an off-the-cuff answer to the question that you have actually asked. I've run this code on taxonomies that contain leaf nodes with a few dozen cells in them and the code gave somewhat reasonable results. What ultimately matters is the statistical distribution of genes in the cells you do have. Can you quickly assess the number of cells in the least populous cluster? Since a variance needs to be calculated, a cluster with 2 or fewer cells in it will put some zeros in unfortunate (but not necessarily fatal) places. Note: as long as each cluster has a sensible (again, ~ a dozen) cell in at least one of the chemistry files, your results should be fine. The final step in mapping only considers the most populous chemistry file for each cluster.
On to the question you didn't ask: I've made a quick bugfix to the code in the trim/tree/240918
branch such that chunks
is set to None
if there is nothing to chunk in that code block. That ought to allow the code to run to completion, including the offending chemistry file. To a pull
and try again. Let me know if I am wrong and the code still crashes.
And thanks for the catch.
@KaczorowskiLab
Sorry. Doubling back this morning to add a unit test for yesterday afternoon's bugfix, I discovered that I hadn't actually fixed the problem. I just pushed another commit that actually will allow the code to run over your data. Sorry about that.
One note: in the course of constructing the test, I got a better understanding of this failure mode. The error you saw means that, for the offending precomputed_stats.CHEMISTRY.h5
file, there were either no up-regulated or now down-regulated marker genes at all across all cell type cluster pairs in the cell type taxonomy tree. This probably means that there were far too few cells represented by that chemistry to be useful, so you really should just leave it out. It might be worthwhile to scan through your cell_metadata.csv
file and look at how many cells are in each cell type cluster for that particular chemistry.
Thank you for the quick fix and for fixing the fix :)
You are right in your assessment that the reference file with sparse data for this attempt had too few cells to contribute (< 10). But I also anticipate the same problem to creep up if one of the chemistry files has skewed representation of cell types for example. So the solution to combine both pre-check for the cell contribution per chemistry file as well as the fix for the chunk size seems reasonable to me.
@danielsf - I was able to successfully run both versions of the cell type mapping pipelines i.e from_specified_markers
and map_to_on_the_fly_markers
. I am using v1.3.6 of the codebase.
However, the results are not accurate. For concordance with a previous attempt, I ran the same sample set using the online MapMyCells which identified all major neuronal and non-neuronal cell types in the dataset. The online version of the tool uses v1.3.1 of the codebase. My attempt with the map_to_on_the_fly_markers
yielded 14 class assignments with one class being over-represented ( > 50%). The immune class was absent in the list. Examining the marker genes from this run listed 4 non-specific markers across all class and some of the subclass distinctions.
Since the online version uses a pre-calculated marker lookup table, I also ran the from_specified_markers
pipeline. Although, specifying query dataset was an option in the marker gene discovery, I choose to use only the reference set to emulate the online mapping version. The results from this run fare even worse than the map_to_on_the_fly_markers
with all cells being assigned one class and only one of genes listed in the marker gene set. This gene is a subset of the 4 genes from the map_to_on_the_fly_markers
run. Not surprisingly, these genes are not present in the marker gene list for each class in the results from the online version of the tool.
Both pipelines print a long list of warning messages to the console of type: UserWarning: parent node 'CCN20230722_CLAS/CS20230722_CLAS_01' had too few markers in query set; augmenting with markers from ['None']
I am not sure if the online tool also generates a similar log file, but it is not included in the downloaded file set.
By selecting only subset of the regions, are any assumptions for marker gene selection and/or mapping being broken ? I am not sure if the difference in code version is a contributing factor for the big change in results, but if v1.3.1 of the code is available with --do_pruning True
option turned on, I would like to re-run my samples against it. Any insights or suggestions you have will be much appreciated!
My initial thought is that using your downsampled cell_metadata.csv file to find markers means there are too few cells in the various cell types to find statistically significant marker genes. I don't have a hard-and-fast number representing what I mean by "too few." One thing to try would be to construct a cell_metadata.csv file as follows:
1) Find all of the cluster_alias values that exist in your region of interest (the spatial region by which you originally subsetted cell_metadata.csv) 2) Take all the cells across the entire dataset that have a cluster_alias in the set you found in (1) and concatenate them into a cell_metadata.csv file
This way, you are still mapping to a taxonomy that is limited to the cell types represented in your region of interest, but you are using the full statistical power of the whole brain data to find the marker genes for discriminating between them.
I guess it is possible that there is no quantitative difference between what I just described and what you are already doing (maybe the cluster_alias values from (1) above are already well-segregated into your region of interest). Is there an appreciable difference between the number of cells in the modified cell_metadata.csv that I am proposing and what you are already using?
I think what I hypothesized above might be correct. Without knowing exactly how you are slicing the data, I sliced the cell_metadata.csv file by region_of_interest_acronym. Below is what I found. The "just cells in roi" line lists how many cells have that region_of_interest_acronym
and, taking only those cells, the median number of cells per cluster represented in that ROI. The "all cells in represented cluster" shows the same statistics for my proposal (take all of the cluster_alias
values represented in the ROI, than loop back over the dataframe, grabbing every cell across the whole brain that is in those clusters). In all cases, the difference between the two schemes is one or two orders of magnitude in the number of cells and (more importantly for the statistical power of marker gene selection) one or two orders of magnitude in median cells per cluster.
roi ACA
just cells in roi: 1.03e+05 cells; median 40.0 cells per cluster
all cells in represented clusters: 2.24e+06 cells; median 1254.5 cells per cluster
roi AI
just cells in roi: 9.90e+04 cells; median 20.0 cells per cluster
all cells in represented clusters: 2.38e+06 cells; median 907.0 cells per cluster
roi AUD
just cells in roi: 7.06e+04 cells; median 32.0 cells per cluster
all cells in represented clusters: 2.21e+06 cells; median 1423.5 cells per cluster
roi AUD-TEa-PERI-ECT
just cells in roi: 2.69e+04 cells; median 7.0 cells per cluster
all cells in represented clusters: 2.26e+06 cells; median 1359.0 cells per cluster
roi CB
just cells in roi: 1.82e+05 cells; median 5.0 cells per cluster
all cells in represented clusters: 1.14e+06 cells; median 166.0 cells per cluster
roi CTXsp
just cells in roi: 1.22e+05 cells; median 9.0 cells per cluster
all cells in represented clusters: 2.23e+06 cells; median 424.0 cells per cluster
roi ENT
just cells in roi: 1.10e+05 cells; median 19.5 cells per cluster
all cells in represented clusters: 2.10e+06 cells; median 783.5 cells per cluster
roi HIP
just cells in roi: 1.76e+05 cells; median 25.0 cells per cluster
all cells in represented clusters: 1.62e+06 cells; median 474.0 cells per cluster
roi HY
just cells in roi: 2.62e+05 cells; median 45.0 cells per cluster
all cells in represented clusters: 1.63e+06 cells; median 147.0 cells per cluster
roi LSX
just cells in roi: 5.38e+04 cells; median 6.0 cells per cluster
all cells in represented clusters: 1.57e+06 cells; median 263.5 cells per cluster
roi MB
just cells in roi: 3.67e+05 cells; median 51.0 cells per cluster
all cells in represented clusters: 1.72e+06 cells; median 109.0 cells per cluster
roi MO-FRP
just cells in roi: 7.14e+04 cells; median 24.0 cells per cluster
all cells in represented clusters: 2.38e+06 cells; median 1372.0 cells per cluster
roi MOp
just cells in roi: 2.48e+05 cells; median 56.5 cells per cluster
all cells in represented clusters: 2.41e+06 cells; median 1343.5 cells per cluster
roi MY
just cells in roi: 1.93e+05 cells; median 33.5 cells per cluster
all cells in represented clusters: 1.22e+06 cells; median 60.0 cells per cluster
roi OLF
just cells in roi: 2.81e+05 cells; median 21.0 cells per cluster
all cells in represented clusters: 2.25e+06 cells; median 424.0 cells per cluster
roi P
just cells in roi: 1.44e+05 cells; median 16.0 cells per cluster
all cells in represented clusters: 1.49e+06 cells; median 71.0 cells per cluster
roi PAL
just cells in roi: 1.08e+05 cells; median 11.0 cells per cluster
all cells in represented clusters: 1.59e+06 cells; median 159.0 cells per cluster
roi PL-ILA-ORB
just cells in roi: 1.06e+05 cells; median 29.0 cells per cluster
all cells in represented clusters: 2.40e+06 cells; median 991.0 cells per cluster
roi RHP
just cells in roi: 1.02e+05 cells; median 19.0 cells per cluster
all cells in represented clusters: 2.25e+06 cells; median 730.5 cells per cluster
roi RSP
just cells in roi: 1.12e+05 cells; median 31.0 cells per cluster
all cells in represented clusters: 2.35e+06 cells; median 1299.0 cells per cluster
roi SS-GU-VISC
just cells in roi: 1.03e+05 cells; median 27.5 cells per cluster
all cells in represented clusters: 2.47e+06 cells; median 1217.5 cells per cluster
roi SSp
just cells in roi: 7.59e+04 cells; median 20.0 cells per cluster
all cells in represented clusters: 2.26e+06 cells; median 1928.0 cells per cluster
roi STRd
just cells in roi: 5.56e+04 cells; median 6.0 cells per cluster
all cells in represented clusters: 1.51e+06 cells; median 772.5 cells per cluster
roi STRv
just cells in roi: 5.36e+04 cells; median 5.0 cells per cluster
all cells in represented clusters: 1.52e+06 cells; median 340.0 cells per cluster
roi TEa-PERI-ECT
just cells in roi: 5.92e+04 cells; median 19.0 cells per cluster
all cells in represented clusters: 2.28e+06 cells; median 1101.0 cells per cluster
roi TH
just cells in roi: 2.61e+05 cells; median 9.0 cells per cluster
all cells in represented clusters: 1.61e+06 cells; median 190.0 cells per cluster
roi VIS
just cells in roi: 3.22e+05 cells; median 70.0 cells per cluster
all cells in represented clusters: 2.42e+06 cells; median 1271.0 cells per cluster
roi VIS-PTLp
just cells in roi: 5.46e+04 cells; median 18.0 cells per cluster
all cells in represented clusters: 2.32e+06 cells; median 1423.5 cells per cluster
roi sAMY
just cells in roi: 1.21e+05 cells; median 21.0 cells per cluster
all cells in represented clusters: 1.53e+06 cells; median 194.5 cells per cluster
This seems very much plausible. I went back to check the unique values for cluster_alias and the cell count distribution. In one of my region files, ~28% of cluster_alias has less than 10 cells assigned to it which is why a large number of them came back empty. This leads to two questions:
For the cluster_alias the did have sufficient cell count in the reference file, shouldn't these have come out with correct assignments in the query set i.e. that had enough statistical power to identify the correct marker genes ?
Would the cluster_alias with low cell count in the region file be considered not representative of the region and so should be filtered out ? I looked at the cluster_alias names for some of these low cell count values and some of them based on anatomical location and marker genes that should not exist in the ROI.
If I move forward with your suggestion to add power from borrowing cells from the whole brain set to the identified cell types, would I essentially subset all the expression matrices to only contains cells of interest and re-run the entire pipeline on it ? Or would it be easier to just rebuild a single expression matrix for each chemistry type and then run the pipeline as usual ?
To answer your questions
For the cluster_alias the did have sufficient cell count in the reference file, shouldn't these have come out with correct assignments in the query set i.e. that had enough statistical power to identify the correct marker genes ?
Marker selection in this code repository is not really concerned with individual cell types. It is concerned with pairs of cell types. We are not asking "what genes uniquely define cell type A in a vacuum." We are asking "what genes are good at distinguishing between cell type A and cell type B" for all possible values of A and B. Thus, if cell type A is well sampled (say, a few thousand cells) but cell type B is poorly sampled (only 5 cells), the code likely won't be able to find markers for that specific (A, B) pair, which will cause problems in the mapping. You can imagine an extreme case in which your taxonomy has 6 cell types in it, but only two of them are well-sampled. The only marker genes you will find will be markers that are useful at distinguishing between those two types and the mapping will fail hopelessly for cells of any other type (and will probably try to assign all of the cells to one of those well sampled types... I haven't really thought about how this failure would play out).
The specific marker-finding algorithm used by the codebase is documented on this page.
Would the cluster_alias with low cell count in the region file be considered not representative of the region and so should be filtered out ? I looked at the cluster_alias names for some of these low cell count values and some of them based on anatomical location and marker genes that should not exist in the ROI.
I am not really qualified to answer this question. My role in principally as a combination data scientist and software engineer. The question you are asking should probably be asked of the scientists who derived the cell type taxonomy. You can ask on the Allen Community Forum (probably under this topic). Warning: I suspect most of the people who can answer are about to travel to the annual Society for Neuroscience meeting, so the response may not come until next week.
If I move forward with your suggestion to add power from borrowing cells from the whole brain set to the identified cell types, would I essentially subset all the expression matrices to only contains cells of interest and re-run the entire pipeline on it ? Or would it be easier to just rebuild a single expression matrix for each chemistry type and then run the pipeline as usual ?
You will need to re-run the whole pipeline (because you will need a new precomputed_stats.h5
file that includes all of the new cells you are including), but you should only need to edit the cell_metadata.csv
file. The pipeline will only grab cells from the h5ad files that are listed in cell_metadata.csv
, so editing that is sufficient to change the population of cells you are working with.
Thank you @danielsf for the detailed answers! I think I was specifically thrown off by the lack of distinguishing markers genes for taxonomy levels that had ample data to support it as well as the actual marker genes that were listed which were all non-specific. When I run the same dataset via the online tool, I get a rather believable list of markers for each of class and subclass levels. I also verifed the expression levels of the top markers in my query dataset. I will follow up with a post on the Allen Community Forum about the actual taxonomy values. I hesitate to change the reference set without proper context about why some taxonomy values were left in regions that are typically not associated with them.
Hello @danielsf,
Thank you for putting together this wonderful tool. For our analysis workflow, we would like to be able to create region specific reference based on the ABC WMB taxonomy. I followed the instructions to create the
precomputed_stats.h5
file from a ABC release as documented here.The inputs for the function include all of the original taxonomy CSV files, the region specific cell metadata file and region specific list of h5ad files. However, I am running into a RuntimeError raised by the validate_taxonomy_tree function.
RuntimeError: node CS20230722_CLUS_0001 (child of CCN20230722_SUPT:CS20230722_SUPT_0001 -- type <class 'str'>) is not present in the keys at level CCN20230722_CLUS e.g. of keys ['CS20230722_CLUS_0326', 'CS20230722_CLUS_0328', 'CS20230722_CLUS_0033', 'CS20230722_CLUS_0038', 'CS20230722_CLUS_0044']
I separately confirmed that the CSV file includes all the parent-child relationships and that level CCN20230722_CLUS does include the node CS20230722_CLUS_0001. I am a bit confused about the setup since the above function is getting the taxonomy tree input from the h5ad file which for the ABC releases are devoid of taxonomy information. Is there a work around for this error ? Thank you for your help!