Closed rbutleriii closed 9 months ago
Bit of a follow-up, prior to adding the metadata, I aggregated the expression matrix for my data as instructed in the Lung CosMx tutorial, and the columns of the matrix are already sorted:
└──Spatial unit "[34mcell[39m"
├──Feature type "[31mrna[39m"
│ └──Expression data "[36mraw[39m" values:
│ An object of class exprObj : "raw"
│ spat_unit : "cell"
│ feat_type : "rna"
│ provenance: cell
│
│ contains:
│ 950 x 8557 sparse Matrix of class "dgCMatrix"
│
│ 6330403K07Rik . 1 . . . . . . . . 2 . . ......
│ Abca2 . . . . 1 . . . . . . . . ......
│ Abi1 . . . . 1 . . . . . 1 . 2 ......
│
│ ........suppressing 8544 columns and 944 rows in show(); maybe adjust options(max.print=, width=)
│
│ Zeb1 . . . . . . . . . . 1 . . ......
│ Zfyve16 . . . . . . . . . . 2 . . ......
│ Zwint . . 1 1 . . . . . . 11 . . ......
│
│ First four colnames:
│ fov011-cell_1 fov011-cell_10
│ fov011-cell_100 fov011-cell_101
│
└──Feature type "[31mneg_probe[39m"
└──Expression data "[36mraw[39m" values:
An object of class exprObj : "raw"
spat_unit : "cell"
feat_type : "neg_probe"
provenance: cell
contains:
10 x 8557 sparse Matrix of class "dgCMatrix"
NegPrb1 . . . . . . . . . . . . . ......
NegPrb10 . . . . . . . . . . . . . ......
NegPrb2 . . . . . . . . . . . . . ......
........suppressing 8544 columns and 4 rows in show(); maybe adjust options(max.print=, width=)
NegPrb7 . . . . . . . . . . . . . ......
NegPrb8 . . . . . . . . . . . . . ......
NegPrb9 . . . . . . . . . . . . . ......
First four colnames:
fov011-cell_1 fov011-cell_10
fov011-cell_100 fov011-cell_101
If I grab the cosmx mini dataset, the expression matrix columns there are not sorted, so it may be down to the overlapToMatrix
?
Hi @rbutleriii
Thank you for reporting and looking into this issue in detail!
From what I have tracked down, the initial sorting happens after a call to data.table::dcast()
(location) when converting from the table of overlapped features to a matrix.
There is an additional sort()
that does not do much, but happens in the giotto
method for overlapToMatrix()
right before adding the matrix to the giotto
object. (location)
These are likely causing the issue by propagating the ordering change into the metadata after addCellMetadata()
A sort()
during overlapToMatrix()
was intended because of the way that aggregating works, where the results are ordered based on which feature points in the giottoPolygon
are first overlapped, leading to row and col orders that can be expected to look random. There is no template for ordering to follow when the expression matrix is generated de novo, so sorting was the next best option.
Normally the metadata is created by a call to initialize()
based on the paired expression information when the matrix is first added to the giotto
object, but something in the convenience function may have delayed the sorting update until you called addCellMetadata()
.
Assuming the above is the issue:
An immediate solution would be to manually apply the desired col and row order from the metadata to the matching expression matrix and then run initialize()
on the giotto
object.
The actual fix, could be implementing a check to match the expression row and col order to the expression information if they differ and/or adding a replacement method for spatIDs()
to intentionally apply it across the object. Or maybe just make it so that the ordering of metadata and expression are independent. Any suggestions would be welcome, and thank you again for catching this.
Best, George
Hmm, it doesn't particularly make it problematic to have it sorted. I would almost always prefer to add metadata with a cell_ID
key column as it is more explicit. The only other option that comes to mind would be for it to by default do a natural sort via mixedsort()
, since the cell_ID
's will almost certainly be some combination of fov123-cell456
.
That is a good idea, thank you.
I am pushing GiottoClass v0.1.3 and GiottoUtils v0.1.3 that should have the following fixes:
gtools::mixedsort()
are now used for cell_ID
and feat_ID
sorts in GiottoClass.addCellMetadata()
and addFeatMetadata()
now check for the starting ID ordering so no more unexpected sorts should happen.Importantly, this also means that the ordering of metadata may be different from the col/row ordering of the expression matrix (and the outputs from the giotto
object method for spatIDs()
and featIDs()
that indirectly pull from the expression matrix).
Our current plan moving forward with spatIDs()
and featIDs()
is that they should be regarded as a minimum set of IDs to expect across all slots, but it is possible for individual slots to either have more IDs if they contain additional information or be in a different order.
In short: it is definitely preferred to add metadata with the cell_ID
key column and by_column = TRUE
. But omitting that key column or appending a vector
or factor
instead of a data.frame
is also safer now.
Describe the Error
For some reason, with my CosMx data the
addCellMetadata
command triggers a sort of the cell rows in the giotto object (it doesn't do this with the starmap mini, vizium mini, or cosmx mini datasets).It looks like under the hood it is joined via a
data.table
merge, which has the default settingsort=T
, but usually follows the ordering inby.x
cell_ID
ifall.x=TRUE
.The reordering appears to be okay, as the giotto object also shifts the ordering of the column names in the expression matrix to match, but if someone is adding successive cell metadata without a cell_ID column as in method 1, the row ordering will now be different.
I can share the giotto object link to the compressed folder, but it is over 100MB with just two fovs.
...
To Reproduce
Expected behavior
Rows in the same order as with the cosmx mini dataset
System Information