Adding subsets used to SCE

LTLA / scuttle

Clone of the Bioconductor repository for the scuttle package.

https://bioconductor.org/packages/devel/bioc/html/scuttle.html

9 stars 7 forks source link

Adding subsets used to SCE #30

Closed llrs closed 3 months ago

llrs commented 3 months ago

This is a draft implementation of the feature request on #29 . I am not sure if the name matches the package conventions or you'd like to have a different approach. Let me know if it makes sense and I'll document the method. Also in my local machine there was an error in tests:

_R_CHECK_LIMITCORES' environment variable detected, BiocParallel workers must be <= 2 was (3)

I don't think it is relevant to the changes I've done. But I was surprised to see that tests use more than 2 cores without skip_on_bioc()...

While Checking the package in my system I noted a couple of issues.

I think R now detects c++ version needed: Specified C++11: please drop specification unless essential
Roxygen documentation hasn't been updated in some time and roxygen2 7.3.2 needs the Encoding field: RoxygenNote: 7.3.2 Encoding: UTF-8

Do you want me to address them here or in a separate PR?

LTLA commented 3 months ago

Mostly looks good. Comments in no particular order:

4 space indents.
Document the return value of featureSelected.
Don't pass ... from addPerCellQCMetrics to featureSelected, there is no guarantee that the arguments are the same between featureSelected and perCellQCMetrics. So, just make subsets= an explicit argument to addPerCellQCMetrics and pass them to both of the internal function calls.
In fact, I'm not sure there's a good reason for featureSelected to be a generic, given that it only really needs the rownames; it could just be a simple function accepting the rownames and subsets.
Do you really need simplify2array? DataFrame constructor works fine with a list.
Consider creating an empty DF with make_zero_col_DFrame in the no-subset case of featureSelected, for parity.
Slap some rownames on the output DFs.
Add some tests.

Don't worry about the other things for now.

llrs commented 3 months ago

Addressed most of the questions (I am not sure if I missed some indentation):

[ ] 4 space indents.
[x] Document the return value of featureSelected.
[x] Don't pass ... from addPerCellQCMetrics to featureSelected, there is no guarantee that the arguments are the same between featureSelected and perCellQCMetrics. So, just make subsets= an explicit argument to addPerCellQCMetrics and pass them to both of the internal function calls.
[x] In fact, I'm not sure there's a good reason for featureSelected to be a generic, given that it only really needs the rownames; it could just be a simple function accepting the rownames and subsets.
[x] Do you really need simplify2array? DataFrame constructor works fine with a list.
[x] Consider creating an empty DF with make_zero_col_DFrame in the no-subset case of featureSelected, for parity.
[x] Slap some rownames on the output DFs.
[x] Add some tests.

LTLA commented 3 months ago

I cleaned it up a bit, once I realized that we have a .subset2index function that handles coercions from different types. Also added a prefix to distinguish it from other things that might already be present in the rowData (and to optionally turn off this addition if it is not desired). See if it still makes sense for you.

llrs commented 3 months ago

I agree it makes sense to have a prefix too. But I would prefer if the default would be subsets instead of just subset as then it would match the default prefix added to colData.

Many thanks for the improvements I learned a couple of tricks there.