cellatlas / mx

7 stars 2 forks source link

mx_assign.py assigns cell types to cells that do not express the associated marker genes #3

Closed harmn closed 2 months ago

harmn commented 5 months ago

Running a pipeline based on the mx.sh script, I found a number of cell type assignments where the cells did not express any of the associated marker genes, but did express marker genes for other cell types.

I found the same issue for the human commons cell atlas. For instance blood sample CRX102285 has six cells labelled with “Smooth muscle cell” which do not express the corresponding markers CNN1 or TAGLN. https://github.com/cellatlas/human/tree/main/data/blood/CRX102285

Debugging the mx_assign.py script it seems that the cell type GMM components get substantial covariance values for marker genes that belong to other cell types (the means for these marker genes do remain 0). This gives cell types responsibility for unrelated marker genes, which seems unwanted?

To addresses this, I added an expression after line 782 of the mx_assign.py script: diff[:,means[k] == 0] = 0 https://github.com/sbooeshaghi/mx/blob/master/mx/mx_assign.py#L782

In context:

    for k in range(n_components):
        diff = X - means[k]
        diff[:,means[k] == 0] = 0
        covariances[k] = np.dot(resp[:, k] * diff.T, diff) / nk[k]
        covariances[k].flat[:: n_features + 1] += reg_covar

But I am sure there are better ways to address this.

sbooeshaghi commented 5 months ago

Hu @harmn thank you for the feedback. This is a great point- I have implemented the change in mx assign and we will be testing and benchmarking it and making changes as necessary.

harmn commented 5 months ago

minor related issue: if the marker file contains an empty line (just a newline), mx assign will treat this as a separate (nameless) cell type and even assigns some cells to this (even though it does not have marker genes). Easy to avoid, but might indicate that my proposed fix is not sufficient.

sbooeshaghi commented 3 months ago

Hi @harmn, thank you again for raising concern of mx assign performance. We've tested the new implementation of mx assign and the results from our simulations do not change giving us confidence that this improvement is well suited to real world data. We are updating the manuscript and will resubmit to the biorxiv soon. I am curious to know how it performs on your data.

I also sincerely appreciate you providing the snakemake file. I've built off of it so that the whole atlas can be built from the reads (when available). Please take a look as it may be useful for you.

https://github.com/cellatlas/human/blob/main/Snakefile

harmn commented 2 months ago

Great, thanks!