cole-trapnell-lab / garnett

Automated cell type classification
MIT License
106 stars 25 forks source link

No constraints on cell type hierarchy during assignment #30

Closed VPetukhov closed 5 years ago

VPetukhov commented 5 years ago

Hi again,

With the new options I debug garnett on my data (human brain) and see some weird behaviour. Here is the simplified annotation:

>Inhibitory
expressed: GAD1, GAD2
not expressed: SLC17A7, SATB2

>Excitatory
expressed: SLC17A7, SATB2
not expressed: GAD1, GAD2

>VIP
expressed: VIP, TAC3
subtype of: Inhibitory

>VIP SEMA3
expressed: SEMA3E, SEMA3C
subtype of: VIP

And here is t-SNE of these markers in the dataset: genes

With the following clustering: clusters

Here, clusters 10, 17, and 27 must be "VIP", while 16 and 32 are "VIP SEMA3".

When I run annotation, garnett assigns most of Excitatory neurons to "VIP SEMA3" cluster, which are Inhibitory indeed (see annotation above): annotation

When I look at the code, I see no place, where validation for inheritance of cell types should be. So, I changed few lines to fill_in_assignments. From:

curr_assignments[Matrix::which(type_res == TRUE)] <- cell_type
level_table[[curr_level]][Matrix::which(new_assignment_mask)] <- cell_type

To:

new_assignment_mask <- (type_res == 1)
if (length(parents) > 1) {
  new_assignment_mask <- new_assignment_mask & (curr_assignments != "Unknown")
}

curr_assignments[Matrix::which(new_assignment_mask)] <- cell_type
level_table[[curr_level]][Matrix::which(new_assignment_mask)] <- cell_type

It did the job, so the new assignment is: annotation_fixed

I don't feel that it's a correct fix though, as I'd expect that garnett shouldn't apply classifier of VIP subtypes to Excitatory neurons in the first place. What's your opinion on that?

hpliner commented 5 years ago

Thanks for this example. I had it set up this way because cell subtypes tended to be also marked by their parent types and so I got similar results either way (in fact, adding the above does not alter my test set), however I think you're right that there are cases where this will be important and the classification should mimic the conditions where the training occurred (i.e. only on the subset). For the moment, I'm going to implement as above with slight tweaks because it will take less overall reworking, but I'll put reworking for better performance as a to do list item for future.

I'm going to do a bit more testing and then I'll push the change

hpliner commented 5 years ago

Currently implemented