DillonHammill / CytoExploreR

Interactive Cytometry Data Analysis
61 stars 13 forks source link

Resolution problem on plotting functions #101

Closed SgtVil closed 2 years ago

SgtVil commented 3 years ago

Hello Dillon, I've noticed a problem of resolution that is happening quite usually : I get big "pixels" where the density is important. Usually I restart my R session and R studio and this does the trick. But this time it doesn't. First, on which package the plotting is dependent. Secondly, do you get this problem ? It might also come from my installation but it seems less plausible.

image

Anyway, thx for your nice package and all that energy you spend innit. Cheers, Rémy

sessionInfo() R version 4.0.3 (2020-10-10) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 18.04.5 LTS

Matrix products: default BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1 LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1

locale: [1] LC_CTYPE=fr_FR.UTF-8 LC_NUMERIC=C LC_TIME=fr_FR.UTF-8 LC_COLLATE=fr_FR.UTF-8
[5] LC_MONETARY=fr_FR.UTF-8 LC_MESSAGES=fr_FR.UTF-8 LC_PAPER=fr_FR.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=fr_FR.UTF-8 LC_IDENTIFICATION=C

attached base packages: [1] stats graphics grDevices utils datasets methods base

other attached packages: [1] shiny_1.5.0 magrittr_2.0.1 forcats_0.5.0 dplyr_1.0.2 purrr_0.3.4
[6] readr_1.4.0 tidyr_1.1.2 tibble_3.0.4 tidyverse_1.3.0 stringr_1.4.0
[11] ggcyto_1.16.0 ncdfFlow_2.36.0 BH_1.72.0-3 RcppArmadillo_0.10.1.2.0 ggplot2_3.3.3
[16] flowViz_1.54.0 lattice_0.20-41 flowStats_4.2.0 CytoExploreR_1.0.8 openCyto_2.2.0
[21] flowWorkspace_4.2.0 flowCore_2.2.0 phyloseq_1.32.0

loaded via a namespace (and not attached): [1] readxl_1.3.1 backports_1.2.1 changepoint_2.2.2 plyr_1.8.6 igraph_1.2.6 splines_4.0.3
[7] RPMG_2.2-3 fda_5.1.9 digest_0.6.27 foreach_1.5.1 htmltools_0.5.0 fansi_0.4.1
[13] Rwave_2.4-8 cluster_2.1.0 ks_1.11.7 hdrcde_3.3 aws.signature_0.6.0 Biostrings_2.56.0
[19] modelr_0.1.8 RcppParallel_5.0.2 matrixStats_0.57.0 R.utils_2.10.1 askpass_1.1 fds_1.8
[25] cytolib_2.2.0 prettyunits_1.1.1 jpeg_0.1-8.1 colorspace_2.0-0 rvest_0.3.5 rrcov_1.5-5
[31] haven_2.3.1 xfun_0.19 crayon_1.3.4 RCurl_1.98-1.2 jsonlite_1.7.2 hexbin_1.28.1
[37] graph_1.68.0 survival_3.2-7 zoo_1.8-8 iterators_1.0.13 ape_5.4-1 glue_1.4.2
[43] flowClust_3.28.0 gtable_0.3.0 zlibbioc_1.36.0 XVector_0.28.0 IDPmisc_1.1.20 Rgraphviz_2.34.0
[49] Rhdf5lib_1.12.0 BiocGenerics_0.36.0 DEoptimR_1.0-8 scales_1.1.1 mvtnorm_1.1-1 DBI_1.1.0
[55] Rcpp_1.0.5 xtable_1.8-4 progress_1.2.2 tmvnsim_1.0-2 clue_0.3-58 reticulate_1.18
[61] rsvd_1.0.3 mclust_5.4.7 RSEIS_3.9-3 stats4_4.0.3 umap_0.2.7.0 htmlwidgets_1.5.3
[67] httr_1.4.2 RColorBrewer_1.1-2 ellipsis_0.3.1 rainbow_3.6 pkgconfig_2.0.3 XML_3.99-0.5
[73] R.methodsS3_1.8.1 dbplyr_2.0.0 tidyselect_1.1.0 rlang_0.4.10 reshape2_1.4.4 later_1.1.0.1
[79] cellranger_1.1.0 visNetwork_2.0.9 munsell_0.5.0 tools_4.0.3 cli_2.2.0 generics_0.1.0
[85] ade4_1.7-16 broom_0.7.3 aws.s3_0.3.21 evaluate_0.14 biomformat_1.16.0 fastmap_1.0.1
[91] yaml_2.2.1 fs_1.5.0 knitr_1.30 EmbedSOM_2.1.1 robustbase_0.93-7 RBGL_1.66.0
[97] nlme_3.1-151 mime_0.9 rhandsontable_0.3.7 R.oo_1.24.0 xml2_1.3.2 compiler_4.0.3
[103] shinythemes_1.1.2 rstudioapi_0.13 curl_4.3 png_0.1-7 reprex_0.3.0 pcaPP_1.9-73
[109] stringi_1.5.3 RSpectra_0.16-0 Matrix_1.3-0 vegan_2.5-7 permute_0.9-5 multtest_2.44.0
[115] vctrs_0.3.6 pillar_1.4.7 lifecycle_0.2.0 flowAI_1.20.1 BiocManager_1.30.10 data.table_1.13.6
[121] bitops_1.0-6 corpcor_1.6.9 httpuv_1.5.4 R6_2.5.0 latticeExtra_0.6-29 promises_1.1.1
[127] gridExtra_2.3 KernSmooth_2.23-18 RProtoBufLib_2.2.0 IRanges_2.22.2 codetools_0.2-18 assertthat_0.2.1
[133] MASS_7.3-53 gtools_3.8.2 rhdf5_2.32.4 openssl_1.4.3 withr_2.3.0 mnormt_2.0.2
[139] S4Vectors_0.28.1 mgcv_1.8-33 parallel_4.0.3 hms_0.5.3 grid_4.0.3 rmarkdown_2.6
[145] Rtsne_0.15 lubridate_1.7.9.2 Biobase_2.50.0 base64enc_0.1-3 ellipse_0.4.2

SgtVil commented 3 years ago

One more little thing : Here this happens when I set limits manually : lim= c(10, 10e5) cyto_plot_gating_scheme(gs_iso, stat = "freq", display=1, popup = F, header = cyto_details(gs_iso)[,1], xlim= lim, ylim=lim)

DillonHammill commented 3 years ago

The issue in the top left panel is likely due to there being events with negative values on one of your axes - for some reason this causes problems with the density computation. Originally, I thought it would only effect the display of linear FSC/SSC data so I put a patch in for this case, but it looks like it occurs in some transformed channels as well.

What transformation have you used for these channels?

My suggestion would be to always pre-clean the data either by setting clean = TRUE in cyto_setup() or by using a boundary/threshold gate as your first gate to remove any negative events.

SgtVil commented 3 years ago

THX for the prompt response once again. And happy new year !

Indeed, there is a lot of negative values in this dataset which very complicated to work with when transforming data (it's bacteria flow cytometry, so we are very small and always in log). Moreover, I recon that cytoflex data (at least mine) got a lot of negative values in comparison with BD Canto or Fortessa. I'm actually trying with clean=T, but for now it doesn't seems to do the trick.

About transformations :

(these are on cleaned data, which doesn't seems to change anything here)

As you suspected removing neg events by hand removes density problems.

image

So this is a working solution !

I actually have another question that might be better on a distinct issue : How can I reset gatings strategy or transformation from a gatingset ? I usually need to get back on my pipeline and I can't remove these, I always have to remake the gatinset by cyto_setup()

Thx a lot !

Cheers, Rémy

SgtVil commented 3 years ago

Forgot to had graph with set up limits : image

DillonHammill commented 3 years ago

You can save your GatingSet to disk using cyto_save(gs, save_as = "directory") and this will retain all your analyses. When you come back just give cyto_setup("directory") that directory and it will take care of everything for you. In fact I like to use the original cyto_setup() call but just change the directory to that where the GatingSet has now been saved.

I will revisit the density problem next week to see if I can come up with a more concrete solution to this problem. I will keep this issue open as a reminder and report back when I have a solution.

DillonHammill commented 3 years ago

@SgtVil, I finally figured out what was causing the problem with displaying the 2D density. I have already prepared a fix for the coming version of CytoExploreR so I thought I would summarise the changes here.

Basically, the grid-like appearance is due to the default bin number that I chose for the calculation of the 2D kernel density estimate. The current default is set to 128 bins, this value cannot be modified by the user and has been optimised based on appearance and plotting speed. Since the coming version of CytoExploreR will plot data much faster, I have increased the number of bins to 200. This will provide a more smoothed appearance at the cost of slightly increasing the plotting speed.

The other issue concerns events outside the limits of the plot. If you manually set xlim and ylim values it acts as a zoom function. Since the data is not trimmed to these limits prior to plotting, the 2D kernel density estimate is computed over the full data range (which extends beyond that of the plot). This means that each bin within the plot is actually stretched to accommodate the extra data - resulting is stretched box-like colours.

The second issue is difficult to resolve. One solution I tried was to gate the data under the hood to ensure that all data passed to the 2D kernel estimator is within the plot limits. This works, but gating is computationally expensive and adds a significant amount of time to the overall plotting time. The benefit of doing things this way is that we don't have to assign colours to or plot points that will not be visible on the plot anyway. In my test case, I have a small proportion of events outside the plot region and so gating is still more expensive. I think that this will be the most common case as it doesn't make sense to exclude large amounts of data from the plotting region using xlim and ylim.

So instead of gating, I have instead adjusted the number of bins based on the range of the data compared to that of the plotting region. This prevents the bins from being stretched to accommodate data outside the plotting region and is relatively quick to calculate. This seems to work quite well in testing. Ideally we would also trim the data to the plotting region prior to computing the bandwidth but this is just as costly as gating - so we instead use all the data to compute the bandwidth for speed. Again based on common cases, the bandwidth should not deviate much if there are a low proportion of events outside the plotting region.

Anyways, there is still some optimisation work to do and if I can find a faster way to gate the data that would resolve all of these issues. If you do encounter any such problems whilst plotting the recommendation is to gate the data to be within the plotting region (i.e. use a boundary gate to exclude extreme events) and re-plot it.

DillonHammill commented 3 years ago

After thinking about this a bit more, I think it is best to NOT trim the data to supplied axes limits. If the data is trimmed too early, the computed statistics will no longer be accurate estimates for the entire data (i.e. deviate from cyto_stats_compute()). So for now, I think the best approach is to fix the bin issue when there is data outside the plot limits and use the default bandwidth estimator (which according to grDevices::densCols() is bw = diff(range(x))/25). I have considered more efficient ways of estimating the bandwidth using the plot limits but this is not a viable solution as plot limits can be altered significantly by users (i.e. difficult to figure out data range from axes limits - can be altered by xlim, ylim and axes_limits_buffer in different ways).

DillonHammill commented 2 years ago

This is now fixed in CytoExploreR version 2.0.0 (coming soon).

I have increased the resolution of the grid to 250 x 250 to remove the grid-like appearance in large plots. Turns out some issues were coming from data outside the plot limits, this has now been fixed as well. I no longer use DensCols() as its default bandwidth choice is not ideal - this is all taken care of by the new cyto_stat_bkde2d().

Be sure to check out the new key_ arguments in cyto_plot() as well as now we can set the same colour scale for each plot (making it easy to compare samples) - key_scale = "fixed" is now the default for cyto_plot(). I have also added an additional point_col_smooth() argument that can be used to toggle between raw binned counts and bkde smoothed counts (the default). Setting point_col_smooth = FALSE can increase plotting speed but produces a more grainy appearance in the plots.