Keep celltype annotation columns when merging SCEs

sjspielman commented 10 months ago

Closes #258

This PR updates SCE merging to ensure that any cell type annotation columns are retained in the colData.

In our current implementation, we pass in a vector retain_coldata_cols to keep any colData columns we want kept. But, we also create these columns if they are missing and populate them with NAs. Therefore, one does not simply add the cell type columns into this vector. If there are no cell type columns, we do not want to end up creating them!

So, I created a new argument in merge_sce_list celltype_coldata_cols with a default value of all the possible cell type columns that one might encounter in the colData slot. This argument gets used in the the subfunction prepare_merge_sce, which has a default of NULL for this argument. Then:

If it's NULL, just ignore it
Otherwise, _after filling in the missing retain_coldata_cols with NA_, tack on the celltype columns which actually exist into retain_coldata_cols. Use this full vector to subset colData.

I updated some docs along the way, and I added some tests ✅

A handful of recreational revisions in the tests as a 🍬 too - stop using sort() to compare vectors and start using test_setequal.

sjspielman commented 10 months ago

I'll also note - the other option here, rather than adding this argument, is to change how we handle retain_coldata_cols. Rather than creating & filling in missing columns with NA, we could just ignore them. If we update with* that approach, we can then make this parallel change in scpca-nf as described here: https://github.com/AlexsLemonade/scpcaTools/issues/258#issuecomment-1895922396

edit: Or yet another approach! We could be less prescriptive with the name for celltype_coldata_cols but keep the argument as implemented. And then, pass in cell type column names in scpca-nf, again as linked ^

allyhawkins commented 10 months ago

In our current implementation, we pass in a vector retain_coldata_cols to keep any colData columns we want kept. But, we also create these columns if they are missing and populate them with NAs. Therefore, one does not simply add the cell type columns into this vector. If there are no cell type columns, we do not want to end up creating them!

I still don't think we need to change anything in the function. I would change the vector that actually gets passed in to the function in scpca-nf to only include the cell type columns that are present. Otherwise you don't include them and they won't get created. That makes it so you don't have anything to do here.

sjspielman commented 10 months ago

I would change the vector that actually gets passed in to the function in scpca-nf to only include the cell type columns that are present. Otherwise you don't include them and they won't get created. That makes it so you don't have anything to do here.

I still don't think this is true. For example, we might have the situation where some but not all of the libraries in a given project have cell types. If that were to happen, we'd end up with NA cell types. This has most potential to happen I think for submitters.

jashapiro commented 10 months ago

In our current implementation, we pass in a vector retain_coldata_cols to keep any colData columns we want kept. But, we also create these columns if they are missing and populate them with NAs. Therefore, one does not simply add the cell type columns into this vector. If there are no cell type columns, we do not want to end up creating them!

I still don't think we need to change anything in the function. I would change the vector that actually gets passed in to the function in scpca-nf to only include the cell type columns that are present. Otherwise you don't include them and they won't get created. That makes it so you don't have anything to do here.

I agree with this. In general we should make changes to that add arguments (and complexity) with great caution.

allyhawkins commented 10 months ago

I still don't think this is true. For example, we might have the situation where some but not all of the libraries in a given project have cell types. If that were to happen, we'd end up with NA cell types. This has most potential to happen I think for submitters.

I think this is okay though. If they exist for any libraries, we want that column to be present and then NA for other libraries.

jashapiro commented 10 months ago

I would change the vector that actually gets passed in to the function in scpca-nf to only include the cell type columns that are present. Otherwise you don't include them and they won't get created. That makes it so you don't have anything to do here.

I still don't think this is true. For example, we might have the situation where some but not all of the libraries in a given project have cell types. If that were to happen, we'd end up with NA cell types. This has most potential to happen I think for submitters.

If some libraries have cell types and others do not, don't we want NA? This would only affect the merged object, right?

sjspielman commented 10 months ago

If they exist for any libraries, we want that column to be present and then NA for other libraries.

I don't actually know... might we want them to be "Submitter excluded"?

This would only affect the merged object, right?

But yes, it would only affect the merged object!

Let's regroup over meet shortly!

allyhawkins commented 10 months ago

I don't actually know... might we want them to be "Submitter excluded"?

If that's the case, then we should still make that change in scpca-nf.

sjspielman commented 10 months ago

Well, it was fun (for me!) while it lasted! Closing this as not happening in this repo.

AlexsLemonade / scpcaTools

Keep celltype annotation columns when merging SCEs #259