Note on branches

I previously made all all my changes to the forked repo bschilder/EWCE on the DelayedArray branch. However, something got screwed up with the git history in that fork, and the only way to push to the main repo was to clone the main repo (NathanSkene/EWCE), create a new branch called bschilder_dev, copy and paste all the files with my DelayedArray branch's edits into this repo, and then make a Pull Request from there. This means you won't have the incremental commits I made as I upgraded EWCE, but at least everything will be harmonized moving forward. I'll delete my old forked repo and make any future edits on this NathanSkene/EWCE@bschilder_dev branch.

Upgrades & new features

All functions can now use lists and CellTypeDatasets (CTD) from any species and convert them to a common species (human by default) via orthogene.
Automated CTD standardisation via standardise_ctd.
Can handle (sparse) matrices.
Can create CTD from very large datasets using DelayedArray object class.
All functions automatically create appropriate gene backgrounds given species.
More modular, simplified vignettes.
Additional gene pre-filtering options (DESeq2, MAST, variance quantiles).
New/improved plotting functions (e.g. plot_ctd).
Added example bootstrapping enrichment results as extdata to speed up examples (documented in data.R). Accessed via EWCE::example_bootstrap_results().
Replaced Travis and R-CMD GHA workflows with check-bioc GHA workflow to automatically: Create a local Bioconductor Docker container, run R-CMD checks, run BiocCheck, and rebuild/deploy pkgdown site. Everything is passing all checks on all 3 platforms!

To do

Automatically build and push a Docker container for EWCE to Docker Hub during the GHA checks. This should be easier now since the new GHA workflow creates exactly that during checks. We just have to push it to the neurogenomicslab DockerHub at the end.

@NathanSkene some of these checks might fail, actually, bc we need to add some variables to GitHub Secrets for this repo. It's really quick and easy, so I can walk you through it via Slack on Monday

Update DESCRIPTION Version to 1.3.1 to reflect latest release

.github/workflows/check-bioc-docker.yml:

run_docker: 'false' - Is this parameter in use? It is pushing to docker with a commit now correct? Also can you test to ensure a push to docker won't occur if a check is failed (either in EWCE or orthogene)

This feature is not being used at the moment. ideally, it would be most efficient to build the Docker container once, test it, and push to DockerHub all in one workflow, but I couldn't get this working smoothly and opted to just use the separate dockerhub.yml workflow based on scFlow.

Checklist

[x] Version 1.3.1 not 2.0.0

Alan and I discussed this. While naming this update 2.0 might be useful for remembering which version had major changes, this isnt allowed by Bioconductor (they dictate the version changes according to their devel/release schedule). Instead, I'll document these updates in the NEWS, README, and vignettes.

R/assign_cores.r:

[x] Mention this and any other functions that can run in parallel in the NEWS file

R/bin_columns_into_quantiles.r

[x] "@param defaultBin Which bin to assign when there's only one non-zero quantile." - I'm not sure I understand what this parameter is doing from the description, can you elaborate a little? If you request 40 bins but there's only 1 unique value in the vector, how do you assign it? 1? 40? defaultBin sets this value as the median bin number.
[x] Your comment here explains it better "In situations where there's only one non-zero quantile, cut() throws an error." maybe add this or a shorter version to the parameter?
[x] Do you have an example of when this happens I'm struggling to understand the use case?
[x] Add a unit test for this case too please

R/bootstrap_enrichment_test.R:

[x] Not sure about the phrasing, is the second last full stop intentional?:

#' @param bg List of gene symbols containing the background gene list
#' (including hit genes). If \code{bg=NULL},
#'  an appropriate gene background will be created automatically.
#' if \code{geneSizeControl=TRUE}.

[x] add function call users can do to view all species to parameter description (and following to parameters):
```
@param genelistSpecies Species that \code{hits} genes came from
#' (no longer limited to just "mouse" and "human").
```
Now points users to new function list_species
[x] This parameter description should be updated now that EWCE can automatically convert hit list species to human right?:
```
#' @param geneSizeControl Whether you want to control for
#' GC content and transcript length. Recommended if the gene list originates
#' from genetic studies (\emph{Default: FALSE}).
#'  If set to \code{TRUE}, then \code{hits} must be from humans.
#' should be used rather than mouse.
```
Can still only use human since gene length will differ between species, and we only have a human reference for this info. Would be nice to extend to other species eventually tho.

R/calculate_meanexp_for_level.R

[x] Add unit tests for the issues you describe here: "#### Guards against issues with DelayedArray"

R/calculate_specificity_for_level.R

[x] Again make sure the unit test above covers this, obv same one should: "#### Guards against issues with DelayedArray"

R/controlled_geneset_enrichment.r

[x] default reps is 10,000 here but 100 in bootstrap_enrichment test, change one to be consistent

R/create_list_network.R

~~[ ] add description of function and parameters even though internal, it's good to have for us~~
~~[ ] Not sure what numBOOT does but if it is number of bootstrap, let's make the default consistent with what I said above too.~~

I just tried to break up the code by putting it into smaller functions. You'll have to ask @NathanSkene regarding what this does exactly.

R/create_quadrants.R

~~[ ] add description of function and parameters even though internal, it's good to have for us~~

Same as above. @NathanSkene

R/ctd_to_sce.R

[x] Add parameters (I know this is a copied function but worth having locally)

R/delayedarray_normalize.R

[x] add description of function and parameters even though internal, it's good to have for us

R/drop_nonexpressed_cells.R

[x] add description of function and parameters even though internal, it's good to have for us

R/drop_nonexpressed_genes.R

[x] add description of function and parameters even though internal, it's good to have for us

R/drop_uninformative_genes.r

[x] description doesn't make it clear that there are other options now to remove genes that don't vary. Expand on this please! **Description already included, you just have to scroll to the bottom of the docs. Will move up to make more apparent.***
[x] #' @param DGE_method: Which method to use for the Differential Gene Expression (DGE) step. - list out the options for the user.
[x] #' @param input_species Which species the gene names in \code{exp} come from. #' @param output_species Which species' genes names to convert \code{exp} to` - again give the users a function to list all available species.
[ ] adj_pval_thresh = 0.00001, - this seems odd to me that's it's so low, why not just 0.05? I get you may have some FP's but surely EWCE will deal with these? Has an analysis been done comparing the different DEG approaches against a benchmark dataset? I think this is important to do since this is a key part of your expansion on EWCE. I imagine you will publish this version of EWCE right? This would form a fundamental part of that paper.

Not sure why this parameter was originally chosen. @NathanSkene ?

[x] no_cores = 1, - again parallel capacity should be in vignette/NEWS file

R/dt_to_df.R

[x] add description of function and parameters even though internal, it's good to have for us

R/ewce_expression_data.r

~~[ ] list other DEG methods available too~~

#' @param tt` Differential expression table.
#' Can be output of \link[limma]{topTable} function.
#' Minimum requirement is that one column stores a metric of ...

Only limma is currently supported for this, as the output columns are tool-specific.

R/ewce_plot.r

[x] I would also state that the multiple testing is then calculated across all results - I don't think this idea is the clearest to all people

\link[EWCE]{bootstrap_enrichment_test} or
#' \link[EWCE]{ewce_expression_data} functions.
#' Multiple results tables can be
#' merged into one results table, as long as the 'list' column is set to
#' distinguish them.

R/filter_variance_quantiles.r

[x] add description of function and parameters

R/get_ctd_levels.R

[x] add description of function and parameters

R/get_exp_data_for_bootstrapped_genes.R

~~[ ] add description of function and parameters.~~

From original EWCE. Ask @NathanSkene

R/get_sig_results.R

[x] add description of function and parameters

R/is_celltypedataset.R

[x] add description of function and parameters

R/is_delayed_array.R

[x] add description of function and parameters

R/is_matrix.R

[x] add description of function and parameters

R/is_sparse_matrix.R

[x] add description of function and parameters

R/max_ctd_depth.R

[x] add description of function and parameters

R/merge_sce_list.R

[x] add description of function and parameters

R/message_parallel.R / R/messager.R / R/myScalesComma.R

[x] add description of function and parameters

R/package.R

[x] What does this do? Haven't seen it before, is it just to give an explanation of the package? Can a user call it in R?

The package.R file is special, it give users a description of the package when they run ?EWCE.

R/plot_bootstrap_plots.R

~~[ ] add description of function and parameters.~~

From the original EWCE. Ask @NathanSkene

R/plot_log_bootstrap_distributions.R / R/plot_with_bootstrap_distributions.R

[x] add description of function and parameters

R/rNorm.R

~~[ ] add description of function and parameters~~

From the original EWCE. Ask @NathanSkene

R/read_ctd.R / R/sce_merge_comparable_levels.R / R/zeisel2018_functions.r

[x] Why are these scripts kept and not just removed? They are all commented out

They're works in progress.

R/run_deseq2.R

[x] LRT is the correct default as it has been proven to be better than Wald but I think you should give the user the option to choose Wald too. Added new arg dge_test
[ ] More generally, I think you should add in edgeR as an DEG option, in our analysis of scRNA-Seq methods, edgeR performs better than DESeq2. See this benchmark paper for the same results (fig 1(c)). Also you should offer the two test types of LRT and QLF but again LRT should be the default

Would be a great add-on but I'd ask that you implement any additional DGE methods you'd like to use.

R/sce_lists_apply.R / R/to_dataframe.R / R/to_delayed_array.R / R/to_sparse_matrix.R

[x] add description of function and parameters

README.Rmd

[x] Code coverage isn't showing which I guess is because this isn't the master branch right? Can you make a note to check this when the push is made to master?

Correct, this will be updated when this branch is merged

~~[x] Installation instructions is for the dev branch can you specify this and add instructions for the current release.~~

Not sure what you mean. Feel free to edit the README further after merging. These are the current instructions.

if (!require("BiocManager")){install.packages("BiocManager")}

BiocManager::install("EWCE")

Getting started vignette:

[x] Overall, I think this is a great idea with the vignettes! Can you add a quick description of the differences in the ctd levels, what they mean and how to use them? No need for code but just explain the differences biologically

Extended vignette:

[x] Can you add more of a description of the differences in the ctd levels, what they mean biologically? You have code for them but I think the explanation could be expanded.
[x] Plot results from multiple sets of enrichment results section - can you add an explanation that multiple testing is applied across all studies in the list.
[x] Can you explain the different species EWCE can now be run with i.e. if your gene list isn't human can still use it etc
[x] Detail that certain parts can be run in parallel and name them (People ask this of packages frequently).

Create CTD vignette:

[x] In Calculate specificity matrices section - describe the different DEG methods for drop_uninformative_genes
[x] For the cortex_mrna$exp_scT_normed <- EWCE::sct_normalize(cortex_mrna$exp) can you block the output of this R chunk? It takes up too much of the vignette

Docker vignette:

[x] This isn't in the README so won't be accessible in the website, can you add it?

All vignettes are accessible via the "Articles" table of the docs website. Will add link in README as well.

Vignettes in general:

[x] Can you make a note to check all the links to the vignettes on the website work when you push to master?
[x] Did you run devtools::build_site() before pushing? In your new GHA and website push approach is this needed? I think it is still necessary as the website is built from the files in docs/articles/

The new GHA workflow automatically rebuilds the whole docs website with pkdown and pushes to the gh-pages branch. So manually rebuilding is no longer necessary.

[ ] devtools check gives note which you will need to sort (remove git history of png's .RData objects should do it):
```
> checking installed package size ... NOTE
installed size is  6.8Mb
sub-directories of 1Mb or more:
  doc   6.3Mb
```
I went through and tried to prune the git history multiple times using all the strategies you suggested, as well as reduced the vignette file size (ie what populates doc/ during building). I'll give it another pass, but if not maybe you could try to get it down further @Al-Murphy ?
[x] Code coverage check with devtools::test_coverage() gave ~70%, Brian I think you need to add more tests to cover the new functionality. We should aim for 80% coverage but more importantly make sure all new functionality is thoroughly tested.

Code coverage is now >84%. All CRAN/Bioc checks are passing locally, but it does take a while. We may have to do some further optimization with tests/examples to keep it below the 15m limit.

In the description, you have now changed it to 1.3.2 since you made additional changes. However since these first batch of changes weren't pushed to bioconductor it should still be 1.3.1. Can you revert?

In the description, you have now changed it to 1.3.2 since you made additional changes. However since these first batch of changes weren't pushed to bioconductor it should still be 1.3.1. Can you revert?

Oh i see, didn't realize that bioc doesn't let you skip versions. Np, changing it back.

Notes:

news.md file lists changes in 1.3.1 and 1.3.2 separately but we reverted to 1.3.1 numbering so just merge these.
Add documentation (description and paramterts) to calculate_specificity_for_level, bootstrap_plot, cell_list_dist, check_annotLevels, check_args_for_bootstrap_plot_generation,check_bootstrap_args, check_controlled_args, check_ewce_expression_data_args, check_full_results, check_generate_controlled_bootstrap_geneset, check_species, check_nas, check_group_name, check_numeric, get_exp_data_for_bootstrapped_genes, get_graph_theme, ``
For delayedarray_normalize.R - Is there a unit test to check this gives the same result as original (not delayed matrix) approach? If not, can you add one?
Similar to the above, consider if all tests are in place for new features but also where new ways of implmenting the same functions in EWCE have been used. These should give the exact same results as before and should be tested for that. Other than this, I get the coverage at 83% which is pretty good so I'm happy enough with that.
For sce_merge_comparable_levels.R, this is all commented out, can it be removed?
README:
- "EWCE requires R>=4.1 and Bioconductor>=1.4" - Should be Bioconductor>=3.14
- Vignettes still aren't available at README link, make a note to check this for when it is merged

Two CRAN check notes that still need to be sorted:


── R CMD check results ────────────────────────────────────────────────────────────────────────────────────── EWCE 1.3.1 ────
Duration: 16m 13.7s

checking installed package size ... NOTE installed size is 6.9Mb sub-directories of 1Mb or more: doc 6.3Mb

checking top-level files ... NOTE Non-standard file/directory found at top level: ‘doc’

[x] news.md file lists changes in 1.3.1 and 1.3.2 separately but we reverted to 1.3.1 numbering so just merge these.

Add documentation (description and parameters) to:

[x] calculate_specificity_for_level
[x] bootstrap_plot
[x] cell_list_dist
[x] check_annotLevels
[x] check_args_for_bootstrap_plot_generation
[x] check_bootstrap_args
[x] check_controlled_args
[x] check_ewce_expression_data_args
[x] check_full_results
[x] check_generate_controlled_bootstrap_geneset
[x] check_species
[x] check_nas
[x] check_group_name
[x] check_numeric
[x] get_exp_data_for_bootstrapped_genes
[x] get_graph_theme

I'll try to do some of these, but as I mentioned before many of them have parameters that were named by @NathanSkene and I'm unsure what they are. I think this is why we wanted to have the meeting first.

In the meantime, the best I can do for some parameters is simply repeating what the argument name is. I am marking all of these with (#fix) so we can know where they are.

@param hit.exp hit.exp (#fix)

[x] For delayedarray_normalize.R - Is there a unit test to check this gives the same result as original (not delayed matrix) approach? If not, can you add one?

Which original normalization function are you referring to? sct_normalize? I wouldn't expect delayedarray_normalize to be exactly the same since it's a different procedure to what SCT does. However, i can at least confirm that DelayedArrays do indeed work as input to sct_normalize.

[x] Similar to the above, consider if all tests are in place for new features but also where new ways of implementing the same functions in EWCE have been used. These should give the exact same results as before and should be tested for that. Other than this, I get the coverage at 83% which is pretty good so I'm happy enough with that.

I would not expect the results to be exactly the same, since EWCE now uses gene backgrounds generated by orthogene. The goal of this was to improve the accuracy of results EWCE produces (due to more comprehensive ortholog data), in addition to expanding EWCE's applicability to other species.

However, the tests I do have ensure that at least some of the top cell-types are still enriched (see test-bootstrap_enrichment_test.R)

[x] For sce_merge_comparable_levels.R, this is all commented out, can it be removed? README:
[x] "EWCE requires R>=4.1 and Bioconductor>=1.4" - Should be Bioconductor>=3.14
[x] Vignettes still aren't available at README link, make a note to check this for when it is merged

Two CRAN check notes that still need to be sorted:

I first tried preventing all code chunks from running in the extended vignette. But this barely reduced the total package size (6.2 --> 6 Mb). What made a huge difference was merging the vignettes. I guess you were right @Al-Murphy , there really is ton of overhead per vignette!

By merging all 5 vignettes into just 2, I've now managed to get the whole package under 5Mb.

> checking installed package size ... NOTE
    installed size is  6.9Mb
    sub-directories of 1Mb or more:
      doc   6.3Mb

[x]

> checking top-level files ... NOTE
Non-standard file/directory found at top level:
‘doc’

Avoided this by adding ^doc$ to the .Rbuildignore file.

Great work, very happy to see this, getting EWCE working with big datasets was basically the first thing I wanted done back when the lab started, thanks for getting it here!

Merry christmas!

From: Brian M. Schilder @.> Sent: 23 December 2021 11:27 To: NathanSkene/EWCE @.> Cc: Skene, Nathan G @.>; Mention @.> Subject: Re: [NathanSkene/EWCE] EWCE 2.0 (PR #47)

This email from @.*** originates from outside Imperial. Do not click on links and attachments unless you recognise the sender. If you trust the sender, add them to your safe senders listhttps://spam.ic.ac.uk/SpamConsole/Senders.aspx to disable email stamping for this address.

Yay!!! 🍾

— Reply to this email directly, view it on GitHubhttps://github.com/NathanSkene/EWCE/pull/47#issuecomment-1000239244, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AH5ZPE2J6QLTVIUITU4IB2DUSMBSZANCNFSM5GSRXSVQ. Triage notifications on the go with GitHub Mobile for iOShttps://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Androidhttps://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub. You are receiving this because you were mentioned.Message ID: @.***>

NathanSkene / EWCE

EWCE 2.0 #47

Note on branches

Upgrades & new features

To do

Checklist

R/assign_cores.r:

R/bin_columns_into_quantiles.r

R/bootstrap_enrichment_test.R:

R/calculate_meanexp_for_level.R

R/calculate_specificity_for_level.R

R/controlled_geneset_enrichment.r

R/create_list_network.R

R/create_quadrants.R

R/ctd_to_sce.R

R/delayedarray_normalize.R

R/drop_nonexpressed_cells.R

R/drop_nonexpressed_genes.R

R/drop_uninformative_genes.r

R/dt_to_df.R

R/ewce_expression_data.r

R/ewce_plot.r

R/filter_variance_quantiles.r

R/get_ctd_levels.R

R/get_exp_data_for_bootstrapped_genes.R

R/get_sig_results.R

R/is_celltypedataset.R

R/is_delayed_array.R

R/is_matrix.R

R/is_sparse_matrix.R

R/max_ctd_depth.R

R/merge_sce_list.R

R/message_parallel.R / R/messager.R / R/myScalesComma.R

R/package.R

R/plot_bootstrap_plots.R

R/plot_log_bootstrap_distributions.R / R/plot_with_bootstrap_distributions.R

R/rNorm.R

R/read_ctd.R / R/sce_merge_comparable_levels.R / R/zeisel2018_functions.r

R/run_deseq2.R

R/sce_lists_apply.R / R/to_dataframe.R / R/to_delayed_array.R / R/to_sparse_matrix.R

README.Rmd

Getting started vignette:

Extended vignette:

Create CTD vignette:

Docker vignette:

Vignettes in general: