Bioconductor / Contributions

Contribute Packages to Bioconductor
135 stars 33 forks source link

tidySCE - brings SingleCellExperiment to tidyverse with no side effects #1615

Closed stemangiola closed 4 years ago

stemangiola commented 4 years ago

Update the following URL to point to the GitHub repository of the package you wish to submit to Bioconductor

Confirm the following by editing each check box to '[x]'

I am familiar with the essential aspects of Bioconductor software management, including:

For help with submitting your package, please subscribe and post questions to the bioc-devel mailing list.

bioc-issue-bot commented 4 years ago

Hi @stemangiola

Thanks for submitting your package. We are taking a quick look at it and you will hear back from us soon.

The DESCRIPTION file for this package is:

Type: Package
Package: tidySCE
Title: 'Brings SingleCellExperiment to the Tidyverse 
Version: 0.1.0
Authors@R: c(person("Stefano", "Mangiola", email = "mangiolastefano@gmail.com",
        role = c("aut", "cre")) )
Description: It creates an invisible layer that allow to see the 'SingleCellExperiment' object 
    as tibble and interact seamlessly with the tidyverse.
License: GPL-3
Depends:
    R (>= 4.0.0),
    SingleCellExperiment
Imports:
    tibble,
    dplyr,
    magrittr,
    tidyr,
    ggplot2,
    rlang,
    purrr,
    lifecycle,
    methods,
    plotly,
    ellipsis,
    tidyselect,
    utils,
    S4Vectors
Suggests:
    BiocStyle,
    testthat,
    knitr,
    GGally,
    markdown,
    SingleCellSignalR,
    scater,
    scran,
    tidyHeatmap,
    igraph,
    uwot
VignetteBuilder: 
    knitr
RdMacros:
    lifecycle
Biarch: true
biocViews: AssayDomain, Infrastructure, RNASeq, DifferentialExpression, GeneExpression, Normalization, Clustering, QualityControl, Sequencing, Transcription, Transcriptomics
Encoding: UTF-8
LazyData: true
RoxygenNote: 7.1.1
stemangiola commented 4 years ago

@mblue9

mtmorgan commented 4 years ago

Please update your package to version 0.99.0. Please make sure that BiocCheck::BiocCheck() passes without error or warning. Please post a comment here when your package meets these minimum requirements.

mtmorgan commented 4 years ago

@stemangiola will you update your package version so the review process can start?

stemangiola commented 4 years ago

Hello Martin,

almost there. Today will be completed.

Maria Doyle (@mblue9) is helping with the submission. Could we add her to the participants list?

mtmorgan commented 4 years ago

participants are just those who participate (post a comment) on the issue...

stemangiola commented 4 years ago

Hello Martin,

done! You can start review.

stemangiola commented 4 years ago

@mtmorgan I tag you just to make sure the notification arrives.

bioc-issue-bot commented 4 years ago

A reviewer has been assigned to your package. Learn what to expect during the review process.

IMPORTANT: Please read this documentation for setting up remotes to push to git.bioconductor.org. It is required to push a version bump to git.bioconductor.org to trigger a new build.

Bioconductor utilized your github ssh-keys for git.bioconductor.org access. To manage keys and future access you may want to active your Bioconductor Git Credentials Account

bioc-issue-bot commented 4 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

Congratulations! The package built without errors or warnings on all platforms.

Please see the build report for more details. This link will be active for 21 days.

Remember: if you submitted your package after July 7th, 2020, when making changes to your repository push to git@git.bioconductor.org:packages/tidySCE to trigger a new build. A quick tutorial for setting up remotes and pushing to upstream can be found here.

mtmorgan commented 4 years ago

Sorry for my slow response

DESCRIPTION

tests

R (comments are on specfic lines of code, but apply throughout)

man

lawremi commented 4 years ago

How does the user reconcile the object displaying itself as a tibble without it actually being a tibble? Will that lead to confusion? How do you handle conflicting method behavior, which may arise between a data.frame (tbl_df) and an SCE?

stemangiola commented 4 years ago

(@mtmorgan I will publish the review compliance this weekend both for tidySCE and tidySE).

Hello @lawremi could you please make a concrete example?

(The user of tidySCE will be able to apply any function that operates on SingleCell experiment no problem (e.g., calculate PCA), however a user does not have to necessarily interact with it directly. This issue resolves much more intuitively that one might think, e.g. examples in the README)

The only conflicts I can think of are ncol nrow (not central to analysis in my opinion as tibble nicely prints dimensions), and we can, with time, find a good solution (e.g. message of returning back-end dimensions for compatibility)

stemangiola commented 4 years ago

Hello @mtmorgan , here is the pull request for the compliance https://github.com/stemangiola/tidySCE/pull/9

Below the tasks resolved (or commented on) - originally worked in this issue https://github.com/stemangiola/tidySCE/issues/8

DESCRIPTION

We are well in-line as I don't like acronyms either. As note, sce is the common name for SingleCellExperiment object e.g. in this official source, is really commonly accepted by the community.

In order to respect both Bioconductor tradition and tidyverse standards (afterall this is a equal-rights marriage) the name tidySingleCellExperiment can be a good solution. (I understand the S4 point of view, but this package is dedicated to end-users, so the elegance in design and vocabulary has them as our priority)

Having said that, at the BioC-Asia2020 we will gather the opinion of the community with a pool at our workshop.

Complied.

Complied.

Complied.

TESTS

Complied.

Complied.

Complied.

I opted for full tydyverse style without the need to create temporary variables

R (comments are on specfic lines of code, but apply throughout)

Complied.

Complied.

I worked on it, and I agree. Respecting this standard is on out to-do list for tidySCE, tidySE and tidybulk. However I was not able to make things work under this standard yet this branch (either the export from roxigen disappears, or the method.class does not get exported). I ask to give us a release cycle to fix this. This although is a better approach, the current does not affect negatively when using any package.

I used cbind to cbind two SingleCellExperiment objects

Complied.

tts[[1]] is a SingleCellExperiment; colData(tts[[1]]) is a DataFrame that cannot be converted to tibble directly; DataFrame is used to update colData from SCE

Complied.

message("tidySCE says: A data frame is returned for independent data analysis.") results in quasi-arbitrary formatting for output, where the white space introduced for programmatic convenience is echoed literally to the user

tidySCE says: A data frame is returned for independent data analysis. Rely on message()'s internal use of paste0() or, if precise formatting is important, use strwrap()

message( "tidySCE says: A data frame is returned for ", "independent data analysis." ) or (presummable in a helper function for re-use across your code)

txt <- paste( "tidySCE says: A data frame is returned for ", "independent data analysis." ) message(paste(strwrap(txt, exdent = 4), collapse = "\n"))

Complied.

tst <- intersect( cols %>% names(), get_special_columns(.data) %>% c(get_needed_columns())) %>% length() %>% gt(0) ) if (tst) {

this is a concatenation "a" %>% c("b")

Complied.

columns <- get_special_columns(.data) %>% c(get_needed_columns()) %>% paste(collapse=", ") stop( "tidySCE says: you are trying to rename a column that is view only ", columns, " ", "(it is not present in the colData). If you want to mutate a view-only", "column, make a copy and mutate that one.", )

Complied.

For the moment we have adopted the stop paste0 function suggested above. In the future this will be an area of improvement.

the code is similar but not identical

This issue is related with the definition of generics (please see above)

We agree, we wait to create this file once the package name has been settled (first point in the review list).

Complied.

MAN

It's used for the figures in the README, as here: https://r-pkgs.org/whole-game.html

mtmorgan commented 4 years ago

Please push your changes to the Bioconductor git repository to trigger a build report.

See usethis::use_pipe() for the basic paradigm for adding methods to existing generics

diff --git a/NAMESPACE b/NAMESPACE
index 9aca9e5..5d732d8 100755
--- a/NAMESPACE
+++ b/NAMESPACE
@@ -59,7 +59,6 @@ S3method(summarise,tidySCE)
 S3method(tidy,SingleCellExperiment)
 S3method(unite,default)
 S3method(unite,tidySCE)
-S3method(unnest,default)
 S3method(unnest,tidySCE_nested)
 export("%>%")
 export(arrange)
diff --git a/R/tidyr_methods.R b/R/tidyr_methods.R
index c925663..84aba1c 100755
--- a/R/tidyr_methods.R
+++ b/R/tidyr_methods.R
@@ -43,23 +43,10 @@
 #'     nest(data=-groups) %>%
 #'     unnest(data)
 #' @rdname tidyr-methods
+#' @name unnest
 #'
 #' @export
-unnest <- function(.data, cols, ..., keep_empty=FALSE, ptype=NULL,
-    names_sep=NULL, names_repair="check_unique") {
-    UseMethod("unnest")
-}
-
-#' @export
-#' @rdname tidyr-methods
-unnest.default <- function(.data, cols, ..., keep_empty=FALSE, ptype=NULL,
-    names_sep=NULL, names_repair="check_unique") {
-    cols <- enquo(cols)
-    tidyr::unnest(.data, !!cols, ...,
-        keep_empty=keep_empty, ptype=ptype,
-        names_sep=names_sep, names_repair=names_repair
-    )
-}
+NULL

 #' @importFrom rlang quo_name
 #' @importFrom purrr imap

Can you clarify when DataFrame does not coerce to tibble directly?

> as_tibble(DataFrame(x=1:2, y = CharacterList(letters, LETTERS), z = GRanges(c("chr1:1-2", "chr2:2-3"))))
# A tibble: 2 x 7
      x y          z.seqnames z.start z.end z.width z.strand
  <int> <I<list>>  <fct>        <int> <int>   <int> <fct>
1     1 <chr [26]> chr1             1     2       2 *
2     2 <chr [26]> chr2             2     3       2 *
lawremi commented 4 years ago

An example is that se[1] takes the first row, while df[1] takes the first column.

stemangiola commented 4 years ago

An example is that se[1] takes the first row, while df[1] takes the first column.

I see. First of al thanks to point this out. The point, that we can make clearer is that tidyseurat behaves like a tibble when operate with tidyverse tools. This allow to not break any backward compatibility.

The operator "[" is not needed when operating with tidyverse, and tidyverse users don't use naturally (I haven't for couple of years now).

In my humble opinion, "[" has quite counter intuitive behavior by itself

And, as novel user of SummarizedExperiment object I find a little bit counter-intuitive the use of "[" operator (which acts by principle on n dimensional rectangular data structures) for hierarchical objects. Seems a bit arbitrary (to an inexperienced user).

The bottom line is that,

lawremi commented 4 years ago

Yes, issuing a warning or something may help from the user's perspective. There is also an issue with interoperability: tidyverse users often pass tibbles to package code that expects a data.frame, with all of its features. One way around this would be to simply provide a tidy/fluent/dplyr-style API on top of SE, without actually giving the the user the impression that it is a tibble, because it is not.

stemangiola commented 4 years ago

One way around this would be to simply provide a tidy/fluent/dplyr-style API on top of SE

This is basically already happening, it's just matter of allowing the classes. However in this case my worry is that all tidy methods would operate on a "black box" as they would not see the data frame (integration of metadata + all cell-wise information) they are operating with.

It is an interesting balance to think about, thanks for the suggestions. I am confident that with all the we will find a nice elegant balance.

Just to reassure :) , this minimal product already makes analyses much more proficient, for who likes tidyverse. Although the abstraction in my opinion is largely valid in it's current form, the details around this marriage will make a big difference in user comfort.

P.S. The possible larger plan will be to base tidybulk on tidySE and tidysinglecell, yet to come, on tidySCE and tidyseurat. So these abstraction can be used directly but they will be the database engine for tidy workflows.

stemangiola commented 4 years ago

tidyverse users often pass tibbles to package code that expects a data.frame

Yes, an understanding of this is needed. (as_tibble transform the tidySE into tibble hard copy, documented).

An (probably the most) elegant solution that I could not implement yet, is rebuilding/hijacking the tibble 'show/print' routine (quite complex), to display "tibble abstraction" or something instead of "tibble". So it is clear that is not a 100% tibble.

That's for sure in the to-do list, an help on this would be incredibly appreciated :)

lawremi commented 4 years ago

To broaden the discussion a bit, one thing that @sa-lee has been working on is a more abstract application of dplyr principles to the manipulation of SummarizedExperiment. Specifically, inventing new verbs, like mutate_col_data(), that chain together fluently, in a way that is familiar to the tidyverse, while still preserving the semantics of SummarizedExperiment and its interoperability with Bioconductor. Of course, this requires a shift in mental model by someone accustomed to the tidyverse. It may however be a worthwhile shift, since the SummarizedExperiment was designed specifically for the analysis of summarized genomic data, and the pattern has been reused, for example, by MultiAssayExperiment.

stemangiola commented 4 years ago

Interesting. I guess the difference is the direction of adapting one to the other :)

stemangiola commented 4 years ago

Hello @mtmorgan

Please push your changes to the Bioconductor git repository to trigger a build report.

Done

See usethis::use_pipe() for the basic paradigm for adding methods to existing generics

Thanks for this example! I think I was missing @name. I will tackle this tomorrow. If it works for a method, I will apply to all methods very quickly

Can you clarify when DataFrame does not coerce to tibble directly?

For example in bind_cols we have bind_cold(<tidysce>, <tibble>)

In the backend we have

bind_cold(colData(sce), <tibble>) with equates to bind_cold(<DFrame>, <tibble>)

Which causes error


bind_cols(DataFrame(x=1:2, y = CharacterList(letters, LETTERS)) , tibble(z=1:2))
Error in get(nm, envir = fn, mode = "function") : 
  object 'expect_equal' of mode 'function' was not found

If I convert to data.frame it does not

bind_cols(as.data.frame(DataFrame(x=1:2, y = CharacterList(letters, LETTERS))) , tibble(z=1:2)) %>% DataFrame()
DataFrame with 2 rows and 3 columns
          x         y         z
  <integer>    <list> <integer>
1         1 a,b,c,...         1
2         2 A,B,C,...         2
mtmorgan commented 4 years ago

I guess my point about as.data.frame() was that as_tibble() works and does not 'introduce' base R notions, so

DataFrame(x=1:2, y = CharacterList(letters, LETTERS)) %>%
    as_tibble() %>%
    bind_cols(tibble(z = 1:2))
mtmorgan commented 4 years ago

builds are triggered with a version bump, so please increment the Bioconductor repository version to 0.99.1

stemangiola commented 4 years ago

I guess my point about as.data.frame() was that as_tibble() works and does not 'introduce' base R notions, so

I see. I will implement.

bioc-issue-bot commented 4 years ago

Received a valid push on git.bioconductor.org; starting a build for commit id: 062119388a3af5a560651d6dec32cb1bfa159b5f

bioc-issue-bot commented 4 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

Congratulations! The package built without errors or warnings on all platforms.

Please see the build report for more details. This link will be active for 21 days.

Remember: if you submitted your package after July 7th, 2020, when making changes to your repository push to git@git.bioconductor.org:packages/tidySCE to trigger a new build. A quick tutorial for setting up remotes and pushing to upstream can be found here.

mtmorgan commented 4 years ago

@stemangiola do we agree to rename the packages tidySingleCellExperiment and tidySummarizedExperiment? We'll need to update our internal database (@lshep) and repository (@nturaga)

stemangiola commented 4 years ago

@mtmorgan I would say so. I just wanted to wait until Saturday to try to involve the community, with a poll at our workshop at BioCAsia2020.

But if you say we are in a rush for some reason I might just start with conversion.

mtmorgan commented 4 years ago

we can wait; post here when you are ready.

stemangiola commented 4 years ago

Dear @mtmorgan,

I don't want to use too much of your time, but your experience here could be very helpful. I have implemented the drop of generics in the branch

https://github.com/stemangiola/tidySCE/tree/drop-generics-2nd-try

It generally works except for two functions (in dplyr they have shared roxigen documentation): bind_rows, bind_cols. Basically they are not correctly exported

(present in the file https://github.com/stemangiola/tidySCE/blob/drop-generics-2nd-try/R/dplyr_methods.R)

The test that fail is this

library(magrittr)
tt <- pbmc_small %>% tidy()
tt_bind <- tt %>% select(groups)
tt %>% tidySCE::bind_cols(tt_bind) 

Error: Input must be a vector, not a `tidySCE` object.
Run `rlang::last_error()` to see where the error occurred.

It keeps calling the dplyr method. There must be something trivial I am missing. An advice would be much appreciated.

mtmorgan commented 4 years ago

dplyr::bind_cols() is not a generic but a plain old function, so you haven't defined a method on it but rather a plain old function that happens to have a dot in its name bind_cols.tidySCE().

It seems like the right thing to do is to lobby the dplyr folks to make bind_cols() a generic.

From the user perspective It won't really work to define a generic in your package -- bind_cols() (without specifying namespace) will 'work' if you load dplyr first and then tidySCE, but not the other way around (I think). So the next-best solution is to implement a new verb.

The help page ?bind_cols is not well-structured. The title of the page is 'drplyr-methods' but one doesn't know what a 'drplyr' is. The paragraph describing bind_cols() is repeated twice but does not mention bind_cols() so it is not clear what the paragraph refers to. Part way through the help page is something that looks like a markdown code chunk with triple ticks surrounding "r, child="man/rmd/overview.Rmd".

stemangiola commented 4 years ago

Thanks, that was illuminating.

So, I have a strategy. So far, for how imperfect was it, I successfully (e.g., tidybulk) applied the "generic + .default" combination for all methods. Now I can only apply to bind_* and in the same time try to push for a proper solution in tidyverse.

Tidyverse at least does a good job saying which method it is overriding when loaded. The R computational universe is already full of function names; the last thing we want to do is to produce other custom-names functions.

About the package name, people tends to like the achronym better. However the poll was not setup by the organisation and we did not have many answers. I will change it to tidySingleCellExperiment and tidySummarizedExperiment

I am pushing changes to Bioconductor

stemangiola commented 4 years ago

Hello @mtmorgan another point. While you suggested to add key packages into dependencies

Depends: includes the non-default packages:
    'SingleCellExperiment', 'dplyr', 'tibble', 'tidyr', 'ggplot2',
    'plotly'
  Adding so many packages to the search path is excessive and importing
  selectively is preferable.

I might move them to Imports. Please let me know if you have concerns. Keeping SingleCellExperiment into dependencies.

mtmorgan commented 4 years ago

If your package accepts or returns objects that need the package in the Depends: to work with the object, then the package should be in the Depends: field. SingleCellExperiment and dplyr definitely belong ini the Depends: field. If you are expecting substantial plotting / you return ggplot2 objects, then ggplot2 belongs there too (but this seems like 'feature creep' for the core functionality of your package, e.g., SummarizedExperiment does not include plotting functions). tidyr could well be relevant for your users, but maybe you only expect them to rarely need functions from this pacakge? In my experience, tibble::tibble() is likely to be used from the tibble package, but it is already available via dplyr.

stemangiola commented 4 years ago

Hello @mtmorgan,

here are the two renamed repositories

https://github.com/stemangiola/tidySingleCellExperiment

https://github.com/stemangiola/tidySummarizedExperiment

nturaga commented 4 years ago

@mtmorgan I've handled the renaming in our system.

lshep commented 4 years ago

the SPB database has also been updated so you should be able to still receive build reports on version bumps in this issue tracker for submission

mtmorgan commented 4 years ago

@stemangiola please update the bioconductor repository to packages/tidySIngleCellExperiment and push changes there

stemangiola commented 4 years ago

Thanks @nturaga and @lshep . Unfortunately

rstudio-2 266 % git remote add upstream git@git.bioconductor.org:packages/tidySingleCellExperiment.git
rstudio-2 267 % git fetch --all
Fetching origin
Fetching upstream
fatal: protocol error: expected sha/ref, got 'shallow 0fd1dc7c0ec3dceb301e7d074a3217ded8fca3b6'
error: Could not fetch upstream
rstudio-2 268 % fatal: the remote end hung up unexpectedly
git fetch --unshallow
fatal: --unshallow on a complete repository does not make sense

Am I missing something?

nturaga commented 4 years ago

Thanks @nturaga and @lshep . Unfortunately

rstudio-2 266 % git remote add upstream git@git.bioconductor.org:packages/tidySingleCellExperiment.git
rstudio-2 267 % git fetch --all
Fetching origin
Fetching upstream
fatal: protocol error: expected sha/ref, got 'shallow 0fd1dc7c0ec3dceb301e7d074a3217ded8fca3b6'
error: Could not fetch upstream
rstudio-2 268 % fatal: the remote end hung up unexpectedly
git fetch --unshallow
fatal: --unshallow on a complete repository does not make sense

Am I missing something?

please try again. It should be fixed

stemangiola commented 4 years ago

Done for both repositories. Thanks!

Please let me know whether any action on my side is required for the publication on Bioconductor.

bioc-issue-bot commented 4 years ago

Your package has been accepted. It will be added to the Bioconductor nightly builds.

Thank you for contributing to Bioconductor!

mtmorgan commented 4 years ago

The master branch of your GitHub repository has been added to Bioconductor's git repository.

To use the git.bioconductor.org repository, we need an 'ssh' key to associate with your github user name. If your GitHub account already has ssh public keys (https://github.com/stemangiola.keys is not empty), then no further steps are required. Otherwise, do the following:

  1. Add an SSH key to your github account
  2. Submit your SSH key to Bioconductor

See further instructions at

https://bioconductor.org/developers/how-to/git/

for working with this repository. See especially

https://bioconductor.org/developers/how-to/git/new-package-workflow/ https://bioconductor.org/developers/how-to/git/sync-existing-repositories/

to keep your GitHub and Bioconductor repositories in sync.

Your package will be included in the next nigthly 'devel' build (check-out from git at about 6 pm Eastern; build completion around 2pm Eastern the next day) at

https://bioconductor.org/checkResults/

(Builds sometimes fail, so ensure that the date stamps on the main landing page are consistent with the addition of your package). Once the package builds successfully, you package will be available for download in the 'Devel' version of Bioconductor using BiocManager::install("tidySingleCellExperiment"). The package 'landing page' will be created at

https://bioconductor.org/packages/tidySingleCellExperiment

If you have any questions, please contact the bioc-devel mailing list (https://stat.ethz.ch/mailman/listinfo/bioc-devel); this issue will not be monitored further.