Function that updates metadata

david-priest commented 1 year ago

I know this has been addressed before, but it would be great to have a CATALYST function that updates the metadata in an sce based on an imported spreadsheet that may contain additional sample_id rows that are no longer present in the sce (because samples are often filtered out of an sce). I have written code that can iterate through a list of desired metadata columns and update them at the base level in the sce (the colData?), but this does not update the associated 'metafiles' (what you get from experiment_info(sce) or sample_ids(sce)). Since these metafiles are so important for other functions like filterSCE() and diffcyt(), it would be great to have a solution which handles all of the updating at once (based on the sample ids actually in the sce rather than those in the imported spreadsheet).

I guess I've kind of answered my own question, in that I can filter the dataframe used to update the metadata by the sample IDs in the sce, and then attach that to the sce, but I guess I'm still a bit worried this will cause errors in diffcyt() etc.

Kind Regards, David Priest

HelenaLC commented 1 year ago

Yes, I think having a function that updates experiment_info and any colData stemming from it makes sense, and I can certainly implement this.
However, I dislike the idea of doing this via an additional spreadsheet. During initialisation of the SCE with prepData, such a sheet is already used. I can see that it would make sense to be able to update the object at any point in time (e.g., after filtering), but adding additional metadata could be easily done with base R in a i) less error-prone way and ii) without CATALYST having to maintain such a function and cover all possible cases. Maybe I misunderstood what you meant by spreadsheet, but I imagine something like this to be absolutely sufficient and most flexible to whatever users want to add to their object:
```
# some table of new metadata
df <- ... 
# match sample IDs
i <- match(sce$sample_id, df$sample_id) 
# specify which columns to keep
j <- setdiff(names(df), "sample_id") 
# add new to existing cell metadata
colData(sce) <- cbind(colData(sce), df[i, j])
# update 'experiment_info' internally (to be implemented)
sce <- updateSCE(sce)
```

david-priest commented 1 year ago

Thanks. For large clinical studies, the metadata can have >100 columns and it is often updated/corrected. Here, I prefer to initialise the sce with a minimal metadata spreadsheet. For the updating spreadsheet, as long as it has a row for all the original sample_ids, it works well.

I guess it's also possible to side-step this issue by avoiding using the experiment info. For example, In filterSCE() I changed the following line to use a custom ei() function (called ei2()) that generates the experiment info from the updated colData, using match(). if (nrow(cdf) != ncol(x) && !is.null(ei <- ei2(x))) This also seems to get around the issue where samples are sometimes dropped when there is a blank field in the metadata.

HelenaLC / CATALYST

Function that updates metadata #303