drighelli / SpatialExperiment

55 stars 20 forks source link

Duplicated columns from spatialData in colData when adding new columns with scater #66

Closed lmweber closed 3 years ago

lmweber commented 3 years ago

We are currently getting duplicated columns in colData when adding new columns in colData using external tools such as scater::addPerCellQC.

This is because colData includes all the spatialData columns at the end by default. So these are then added a second time when new columns have been added with scater::addPerCellQC or other tools.

My hack to remove them right now is to run the following afterwards:

colData(spe) <- colData(spe)[, !(duplicated(colnames(colData(spe))) | duplicated(colnames(colData(spe)), fromLast = TRUE))]

which removes all the spatialData columns from colData, and then they are automatically re-added again one final time at the end.

I will think about how best to address this, and we can discuss next week.

One idea could be to have an argument colData(spe, spatialData = FALSE) by default, so the spatialData columns are only returned inside colData if the user wants them. If they are off by default, then I think they won't be duplicated.

We could also have two arguments colData(spe, spatialData = FALSE, spatialCoords = FALSE) for completeness.

lmweber commented 3 years ago

I did some more tests and seems this happens whenever a new column is added to colData. So when a new column is added to colData, an additional duplicate copy of the spatialData columns are automatically added at the end.

I think the colData(spe, spatialData = FALSE) idea would most likely solve it.

From a user perspective, you could then still use colData(spe, spatialData = TRUE, spatialCoords = TRUE) to pipe to ggplot for plotting.

drighelli commented 3 years ago

Thanks Lukas,

we surely need to talk about this!

lmweber commented 3 years ago

Ok as discussed today, I tried to add a colData show method with these additional arguments. However this isn't working yet - I think due to NAMESPACE issues and the way it interacts with the SummarizedExperiment colData show method.

@drighelli could you please have a look at what I have done here, and let me know if you can see what I have done wrong? I have also tried removing @importFrom SummarizedExperiment colData in other functions for the NAMESPACE.

The code is in this commit: https://github.com/drighelli/SpatialExperiment/commit/64dd16573f5c47a88108b569bbf5d10c95b1bbce

Here is also a minimal example showing the duplicated columns issue:

library(SpatialExperiment)
example(read10xVisium)
colData(spe)
colData(spe)$testing <- rep(1, ncol(spe))
colData(spe)  ## shows duplicated spatialData columns
drighelli commented 3 years ago

Thanks Lukas,

I don't know why, but I'm getting a weird error when creating an SpE with your code, but I found the problem in our master branch.

This is the error: it comes up when I try to create a SpE also if I start from the master branch and add your code again...

(spe <- read10xVisium(samples, sample_ids, type = "sparse", data = "raw", images = "lowres", load = FALSE)) Error: C stack usage 7969600 is too close to the limit

The problem is due to this assignment, so I think that we need to patch this.

We store the spatialData and cbind them to the previously stored colData...

drighelli commented 3 years ago

ok the error can be solved by using the x@colData instead of colData(x), if you call colData(x) inside the colData method, it becomes a recursive functions and it goes in stack overflow.

This is due to the fact that the colData method is invoked during the construction of the object.

The problem now is to avoid problems, I'm getting errors when trying to retrieve the spatialData because it doesn't store them into the colData

drisso commented 3 years ago

I wonder if it should be callNextMethod() rather than x@colData...

To elaborate a bit: callNextMethod() inside the colData() method definition should call the colData method for SingleCellExperiment, which obviously does not return the spatialData.

drighelli commented 3 years ago

the problem is that the spatialData are stored inside the colData, so the callNextMethod will return thex@colData that contains the spatialData

Anyway, I tried and it doesn't work :)

drisso commented 3 years ago

But what about calling callNextMethod instead of colData here: https://github.com/drighelli/SpatialExperiment/blob/64dd16573f5c47a88108b569bbf5d10c95b1bbce/R/SpatialExperiment-colData.R#L115

I'm just guessing, but since you mentioned that recursion is the problem...

drighelli commented 3 years ago

Thanks Davide, I made it locally, it solves the recursion problem, but still doesn't solve the colData duplication...

lmweber commented 3 years ago

Great, thanks for finding the recursion error!

I will think some more about whether this solution can be used to fix the duplicated columns issue (e.g. by adding spatialData = TRUE or FALSE wherever colData() is called in other functions), or whether we need some other solution.

drighelli commented 3 years ago

should be fixed here

lmweber commented 3 years ago

Hmm I just tested this, and it doesn't seem to work. When I do the following, I still get duplicated columns from spatialData:

example(read10xVisium)
colData(spe)
colData(spe)$testing <- 1
colData(spe)

Output before adding new column:

> colData(spe)
DataFrame with 99 rows and 4 columns
                     sample_id in_tissue array_row array_col
                   <character> <logical> <integer> <integer>
AAACAACGAATAGTTC-1    section1     FALSE         0        16
AAACAAGTATCTCCCA-1    section1      TRUE        50       102
AAACAATCTACTAGCA-1    section1      TRUE         3        43
AAACACCAATAACTGC-1    section1      TRUE        59        19
AAACAGAGCGACTCCT-1    section1      TRUE        14        94
...                        ...       ...       ...       ...

Output after adding new column:

> colData(spe)
DataFrame with 99 rows and 8 columns
                     sample_id in_tissue array_row array_col   testing in_tissue array_row array_col
                   <character> <logical> <integer> <integer> <numeric> <logical> <integer> <integer>
AAACAACGAATAGTTC-1    section1     FALSE         0        16         1     FALSE         0        16
AAACAAGTATCTCCCA-1    section1      TRUE        50       102         1      TRUE        50       102
AAACAATCTACTAGCA-1    section1      TRUE         3        43         1      TRUE         3        43
AAACACCAATAACTGC-1    section1      TRUE        59        19         1      TRUE        59        19
AAACAGAGCGACTCCT-1    section1      TRUE        14        94         1      TRUE        14        94
...                        ...       ...       ...       ...       ...       ...       ...       ...
lmweber commented 3 years ago

However when I change the new block of code to check spatialDataNames instead of spatialCoordsNames I think it works.

Will test it some more and push some additional commits into this branch.

lmweber commented 3 years ago

I think this commit fixes it: 62014b389dc03a23168de33bed5f7a30d1c0606f

lmweber commented 3 years ago

I added a unit test too: 80436b886ad56dda15a6695016b00bed043a24e3

drighelli commented 3 years ago

fixed in https://github.com/drighelli/SpatialExperiment/pull/68