@bdpedigo et al. (I'm afraid I don't have the other's Github handles)
Just to quickly follow up on our short discussion just now.
I'm afraid I was a bit scatter-brained and didn't do a good job explaining the question I'm (personally) most keen on but you actually summed it up quite nicely into a single word: "stability". Basically: what's the highest granularity we can reach while making sure that groups/clusters can still be reliably recovered across data sets? So it's both a matching and a grouping/clustering problem.
Let's say, for example, you have 5 neurons A, B, C, D and E on FAFB left that fall into two obvious clusters (A, B, C) and (D, E). In a first step you would try to find matches for these 5 neurons in FAFB right and hemibrain (keeping in mind that it might not be a 1:1:1 matching). Once we have that, we can ask whether we see the same clusters in the other two data sets, - or whether we see e.g. (A, B) and (C, D, E) in the hemibrain and (A, D) and (B, C, E) in FAFB right. A conservation of clusters/groups supports the view that (A, B, C) and (D, E) likely represent two cell types. In case of the latter they are more likely to represent a single cell type (A, B, C, D, E). To my mind, it's critical to include all three data sets in that comparison (e.g. to have a tie-breaker).
In practical terms for potential next steps: I'd be very curious to see how the highly granular hemibrain labels behave after matching the hemibrain neurons to FAFB left and right. To illustrate with another example: let's say you have five hemibrain mPNs falling into two types (labels) - 2 x M_lvPNm25 and 3 x M_lvPNm26 - and you find matches for all 5 in FAFB left and FAFB right. When you then look at the 2 M_lvPNm25 and 3 M_lvPNm26 candidates in FAFB left and FAFB right: are they more similar to each other within type (i.e. M_lvPNm25 <-> M_lvPNm25 and M_lvPNm26 <-> M_lvPNm26) than across type, or do you see cases where a putative M_lvPNm25 match is actually more similar to a M_lvPNm26 candidate?
Rephrasing the above: leveraging not just one but three data sets, do you see any indication that e.g. M_lvPNm25 and M_lvPNm26 should really have the same label. Or conversely: maybe M_lvPNm25 actually breaks into multiple groups in FAFB left and right.
I hope this makes some sense. As Greg mentioned, in our recent preprint I used a rather naive approach with only across- but not within-dataset matches to try and address this but you guys are obviously much more experienced with that kind of thing. I also imagine that it will be difficult to get clear-cut answers to above questions but maybe you can think of a way to get a something like a "stability score" - i.e. something that describes how well a given group can be recovered in another dataset.
@bdpedigo et al. (I'm afraid I don't have the other's Github handles)
Just to quickly follow up on our short discussion just now.
I'm afraid I was a bit scatter-brained and didn't do a good job explaining the question I'm (personally) most keen on but you actually summed it up quite nicely into a single word: "stability". Basically: what's the highest granularity we can reach while making sure that groups/clusters can still be reliably recovered across data sets? So it's both a matching and a grouping/clustering problem.
Let's say, for example, you have 5 neurons
A
,B
,C
,D
andE
on FAFB left that fall into two obvious clusters(A, B, C)
and(D, E)
. In a first step you would try to find matches for these 5 neurons in FAFB right and hemibrain (keeping in mind that it might not be a 1:1:1 matching). Once we have that, we can ask whether we see the same clusters in the other two data sets, - or whether we see e.g.(A, B)
and(C, D, E)
in the hemibrain and(A, D)
and(B, C, E)
in FAFB right. A conservation of clusters/groups supports the view that(A, B, C)
and(D, E)
likely represent two cell types. In case of the latter they are more likely to represent a single cell type(A, B, C, D, E)
. To my mind, it's critical to include all three data sets in that comparison (e.g. to have a tie-breaker).In practical terms for potential next steps: I'd be very curious to see how the highly granular hemibrain
labels
behave after matching the hemibrain neurons to FAFB left and right. To illustrate with another example: let's say you have five hemibrain mPNs falling into two types (labels) - 2 xM_lvPNm25
and 3 xM_lvPNm26
- and you find matches for all 5 in FAFB left and FAFB right. When you then look at the 2M_lvPNm25
and 3M_lvPNm26
candidates in FAFB left and FAFB right: are they more similar to each other within type (i.e.M_lvPNm25 <-> M_lvPNm25
andM_lvPNm26 <-> M_lvPNm26
) than across type, or do you see cases where a putativeM_lvPNm25
match is actually more similar to aM_lvPNm26
candidate?Rephrasing the above: leveraging not just one but three data sets, do you see any indication that e.g.
M_lvPNm25
andM_lvPNm26
should really have the same label. Or conversely: maybeM_lvPNm25
actually breaks into multiple groups in FAFB left and right.I hope this makes some sense. As Greg mentioned, in our recent preprint I used a rather naive approach with only across- but not within-dataset matches to try and address this but you guys are obviously much more experienced with that kind of thing. I also imagine that it will be difficult to get clear-cut answers to above questions but maybe you can think of a way to get a something like a "stability score" - i.e. something that describes how well a given group can be recovered in another dataset.