[REQ] data format - Githubissues

joe-jhou2 commented 4 years ago

It's my first time using the prepare function. It seems the panel and md data format are critical for this wrapper.

Can you grant the users the feasibility on the colnames of panel and md data? If we have multiple conditions (time, treatment etc) in md, how to arrange?

HelenaLC commented 4 years ago

You can pass any metadata column names to argument md_cols. Required are at least

a column matching the FCS file names (default file = "file_name")
unique sample identifiers (default id = "sample_id")
minimum one factor of interest (default factors = c("condition", "patient_id"))

Thus, in your case, I'd recommend something like

md_cols = list(
    file = "file_name", 
    id = "sample_id", 
    factors = c("treatment", "time", ...)))

Please be free to come back with questions if your need additional advice, but do see the function's documentation first, thanks!

joe-jhou2 commented 4 years ago

I did so, but get this error:

sce = prepData(fcs_clean, panel, meta,

cofactor = 5,

panel_cols = list(channel = "fcs_colname", antigen = "antigen"),

md_cols = list(file = "file_name", id = "id",factors = c("PID","Visit","DR1","DR3","DR4","DR7","DR11","HLA", "Group","Age","Sex"))) Error in prepData(fcs_clean, panel, meta, cofactor = 5, panel_cols = list(channel = "fcs_colname", : all(ids %in% md[[md_cols$file]]) is not TRUE

On Sat, Mar 21, 2020 at 2:16 AM Helena L. Crowell notifications@github.com wrote:

Closed #85 https://github.com/HelenaLC/CATALYST/issues/85.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/HelenaLC/CATALYST/issues/85#event-3151581885, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABFHYA5LCSDKQSBFUU3BTHLRISAWZANCNFSM4K76LTUQ .

HelenaLC commented 4 years ago

Okay, so that should not be an issue with the panel/metadata column names. The following lines lead up to this error:

ids <- c(keyword(fs, "FILENAME"))
if (is.null(unlist(ids))) 
    ids <- c(fsApply(fs, identifier))
stopifnot(all(ids %in% md[[md_cols$file]]))

Thus, I suggest you check the output of keyword(fs, "FILENAME") and/or fsApply(fs, identifier) matches exactly with the filenames listed in meta$file_name. Possibly there's a typo or missing file?

SamGG commented 4 years ago

Hi, Got a related issue when reading flowset. The keyword FILENAME is the full path when reading from a command like fs <- read.flowSet(path = i_dir, pattern = "fcs$", transformation = FALSE, truncate_max_range = FALSE). The identifier corresponds to the basename of each file. But I rather set file_name as the basename of FCS file in the md file, especially if I think of moving the FCS files. But maybe you consider that I should set the working directory as the directory where the FCS are, in which case there is no problem. Ithink this should be clarified. Samuel NB: The keyword could not be null https://github.com/RGLab/flowCore/blob/2d673febf58af83becdc23f4f6c171d60f5d8694/R/IO.R#L351 So identifier is never used (unless a change of flowcore)

HelenaLC commented 4 years ago

Agree, this wasn't implemented well, and there have been >10 issues related to this. With the next release, filename checking will be implemented as follows:

    # check that filenames or identifiers 
    # match b/w 'flowSet' & metadata
    ids0 <- md[[md_cols$file]]
    ids1 <- basename(keyword(fs, "FILENAME"))
    ids2 <- fsApply(fs, identifier)
    check1 <- all(ids1 %in% ids0)
    check2 <- all(ids2 %in% ids0)
    ids_use <- which(c(check1, check2))[1]
    ids <- list(ids1, ids2)[[ids_use]]
    if (is.null(ids)) {
        stop("Couldn't match 'flowSet'/FCS filenames\n", 
            "with those listed in 'md[[md_cols$file]]'.")
    } else {
        # reorder 'flowSet' frames according to metadata table
        fs <- fs[match(md[[md_cols$file]], ids)]
    }

I.e., we check both the identifier and filenames (basename!), and test if one of them matches. This type of error should thus disappear, unless neither is specified.

SamGG commented 4 years ago

That's great. Thanks a lot.

joe-jhou2 commented 4 years ago

Hi Helena,

I have this error info as below. I try to check levels of every variables in the meta file, but not clue to trace any one has duplicated levels. Can you help me to interpret this error? Thanks.

Tetramer_sce = prepData(fcs_clean, panel, meta,

cofactor = 5,

panel_cols = list(channel = "fcs_colname", antigen = "Antigen"),

md_cols = list(file = "FileName", id = "id",

factors = c("PID","Visit","HLA1","Group",

"Peptide","Epitope","Alt","Age","Sex","GroupVisit"))) Error in levels<-(*tmp*, value = as.character(levels)) : factor level [2] is duplicated

On Sun, Apr 5, 2020 at 11:32 PM Samuel Granjeaud notifications@github.com wrote:

That's great. Thanks a lot.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/HelenaLC/CATALYST/issues/85#issuecomment-609592135, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABFHYAZU45GJWAQTD4CP5M3RLFZQNANCNFSM4K76LTUQ .

markrobinsonuzh commented 4 years ago

@mimisikai just to comment that, in general, "good questions get good answers" .. so the more details you can provide (e.g., an example that one of us can run to reproduce the error) will go a long way to help us and be faster for you to receive help.

Could you give as many details as is reasonable ? For example, the outputs of:

head(panel)

head(meta)

colnames(fcs_clean)

table(meta$id)

sessionInfo()

joe-jhou2 commented 4 years ago

Thanks. FYI.

head(panel) fcs_colname Isotype Metal Antigen marker_class 1 Pr141Di 141 Pr CCR6 type 2 Nd143Di 143 Nd CD39 type 3 Nd144Di 144 Nd CD73 type 4 Nd146Di 146 Nd CD95 type 5 Sm147Di 147 Sm CXCR5 type 6 Nd148Di 148 Nd CCR7 type

head(meta) PID FileName Assay Visit HLA1 Ag Peptide Epitope 1 10092 export_053118 ITN10092 tet V1 post_DR4_MPp54.fcs Tetramer T0 DR4 Flu MPp54 DR4_MPp54 2 10092 export_053118 ITN10092 tet V1 post_DR4_Phlp1p13.fcs Tetramer T0 DR4 TimothyGrass p1p13 DR4_Phlp1p13 3 10092 export_053118 ITN10092 tet V1 post_DR4_Phlp1p29.fcs Tetramer T0 DR4 TimothyGrass p1p29 DR4_Phlp1p29 4 10092 export_053118 ITN10092 tet V1 post_DR4_Phlp5bp7p23.fcs Tetramer T0 DR4 TimothyGrass p5bp7p23 DR4_Phlp5bp7p23 5 10092 export_053118 ITN10092 tet V1 post_DR4_Poa1p13.fcs Tetramer T0 DR4 KentuckyGrass a1p13 DR4_Poa1p13 6 10092 export_053118 ITN10092 tet V1 post_DR4_RVVP2p21VP3p16.fcs Tetramer T0 DR4 RV VP2p21VP3p16 DR4_RVVP2p21VP3p16 Alt DRB1 DRB2 HLA2 Group Age Sex id GroupVisit 1 MPp54 DR4 DR7 DR4_DR7 SLIT 54.93 Male 10092_T0 SLIT_T0 2 Phlp1p13 DR4 DR7 DR4_DR7 SLIT 54.93 Male 10092_T0 SLIT_T0 3 Phlp1p29 DR4 DR7 DR4_DR7 SLIT 54.93 Male 10092_T0 SLIT_T0 4 Phlp5bp7p23 DR4 DR7 DR4_DR7 SLIT 54.93 Male 10092_T0 SLIT_T0 5 Poa1p13 DR4 DR7 DR4_DR7 SLIT 54.93 Male 10092_T0 SLIT_T0 6 RVVP2p21VP3p16 DR4 DR7 DR4_DR7 SLIT 54.93 Male 10092_T0 SLIT_T0

colnames(fcs_clean) [1] "Dy161Di" "Dy162Di" "Dy163Di" "Dy164Di" "Er166Di" "Er167Di" "Er168Di" "Er170Di" "Eu151Di" "Eu153Di" "Gd155Di" "Gd156Di" "Gd158Di" [14] "Gd160Di" "Ho165Di" "Lu175Di" "Nd143Di" "Nd144Di" "Nd146Di" "Nd148Di" "Pr141Di" "Sm147Di" "Sm149Di" "Sm152Di" "Sm154Di" "Tb159Di" [27] "Tm169Di" "Yb171Di" "Yb172Di" "Yb173Di" "Yb174Di" "Yb176Di"

table(meta$id)

10092_T0 10092_T106 10092_T206 10092_T306 10124_T0 10124_T106 10124_T206 10124_T306 10290_T0 10290_T106 10290_T206 10290_T306 12 12 12 12 5 5 5 5 6 6 6 6 10300_T0 10300_T106 10300_T206 10300_T306 10344_T0 10344_T106 10344_T206 10344_T306 10361_T0 10361_T106 10361_T206 10361_T306 7 7 7 7 5 5 5 5 11 11 11 11 10443_T0 10443_T106 10443_T206 10443_T306 10493_T0 10493_T106 10493_T206 10493_T306 10592_T0 10592_T106 10592_T206 10592_T306 4 4 4 4 5 5 5 5 5 5 5 5 10652_T0 10652_T106 10652_T206 10652_T306 10680_T0 10680_T106 10680_T206 10680_T306 10734_T0 10734_T106 10734_T206 10734_T306 6 6 6 6 5 5 5 5 7 7 7 7 10784_T0 10784_T106 10784_T206 10784_T306 10822_T0 10822_T106 10822_T206 10822_T306 10833_T0 10833_T106 10833_T206 10833_T306 12 12 12 12 4 4 4 4 9 9 9 9 10971_T0 10971_T106 10971_T206 10971_T306 11071_T0 11071_T106 11071_T206 11071_T306 11082_T0 11082_T106 11082_T206 11082_T306 12 12 12 12 4 4 4 4 11 11 11 11 11131_T0 11131_T106 11131_T206 11131_T306 11170_T0 11170_T106 11170_T206 11170_T306 11202_T0 11202_T106 11202_T206 11202_T306 5 5 5 5 7 7 7 7 6 6 6 6 11241_T0 11241_T106 11241_T206 11241_T306 11280_T0 11280_T106 11280_T206 11280_T306 11301_T0 11301_T106 11301_T206 11301_T306 4 4 4 4 4 4 4 4 10 10 10 10 11384_T0 11384_T106 11384_T206 11384_T306 11450_T0 11450_T106 11450_T206 11450_T306 11532_T0 11532_T106 11532_T206 11532_T306 11 11 11 11 12 12 12 12 8 8 8 8 11560_T0 11560_T106 11560_T206 11560_T306 11571_T0 11571_T106 11571_T206 11571_T306 11582_T0 11582_T106 11582_T206 11582_T306 6 6 6 6 11 11 11 11 4 4 4 4 11620_T0 11620_T106 11620_T206 11620_T306 11631_T0 11631_T106 11631_T206 11631_T306 11642_T0 11642_T106 11642_T206 11642_T306 4 4 4 4 12 12 12 12 6 6 6 6 11653_T0 11653_T106 11653_T206 11653_T306 11801_T0 11801_T106 11801_T206 11801_T306 11823_T0 11823_T106 11823_T206 11823_T306 5 5 5 5 10 10 10 10 5 5 5 5 11933_T0 11933_T106 11933_T206 11933_T306 11944_T0 11944_T106 11944_T206 11944_T306 11961_T0 11961_T106 11961_T206 11961_T306 4 4 4 4 6 6 6 6 4 4 4 4 11983_T0 11983_T106 11983_T206 11983_T306 11994_T0 11994_T106 11994_T206 11994_T306 12022_T0 12022_T106 12022_T206 12022_T306 13 13 13 13 4 4 4 4 11 11 11 11 12110_T0 12110_T106 12110_T206 12110_T306 12270_T0 12270_T106 12270_T206 12270_T306 12 12 12 12 5 5 5 5

sessionInfo() R version 3.6.2 (2019-12-12) Platform: x86_64-apple-darwin15.6.0 (64-bit) Running under: macOS Catalina 10.15.4

Matrix products: default BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib

locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages: [1] grid stats graphics grDevices utils datasets methods base

other attached packages: [1] CATALYST_1.10.2 flowCore_1.52.1 circlize_0.4.8 ComplexHeatmap_2.2.0 pheatmap_1.0.12 ggpubr_0.2.5 [7] magrittr_1.5 tidyr_1.0.2 tibble_3.0.0 scales_1.1.0 dplyr_0.8.5 reshape2_1.4.3 [13] ggalluvial_0.11.1 ggplot2_3.3.0

loaded via a namespace (and not attached): [1] shinydashboard_0.7.1 R.utils_2.9.2 ks_1.11.7 tidyselect_1.0.0 [5] htmlwidgets_1.5.1 BiocParallel_1.20.1 Rtsne_0.15 munsell_0.5.0 [9] codetools_0.2-16 DT_0.13 withr_2.1.2 colorspace_1.4-1 [13] flowViz_1.50.0 Biobase_2.46.0 rstudioapi_0.11 stats4_3.6.2 [17] SingleCellExperiment_1.8.0 flowClust_3.24.0 robustbase_0.93-6 ggsignif_0.6.0 [21] openCyto_1.24.0 labeling_0.3 GenomeInfoDbData_1.2.2 mnormt_1.5-6 [25] farver_2.0.3 flowWorkspace_3.34.1 rprojroot_1.3-2 vctrs_0.2.4 [29] TH.data_1.0-10 R6_2.4.1 GenomeInfoDb_1.22.1 ggbeeswarm_0.6.0 [33] clue_0.3-57 rsvd_1.0.3 bitops_1.0-6 DelayedArray_0.12.2 [37] assertthat_0.2.1 promises_1.1.0 multcomp_1.4-12 beeswarm_0.2.3 [41] gtable_0.3.0 processx_3.4.2 sandwich_2.5-1 rlang_0.4.5 [45] GlobalOptions_0.1.1 splines_3.6.2 lazyeval_0.2.2 hexbin_1.28.1 [49] shinyBS_0.61 BiocManager_1.30.10 yaml_2.2.1 abind_1.4-5 [53] backports_1.1.5 httpuv_1.5.2 IDPmisc_1.1.20 RBGL_1.62.1 [57] tools_3.6.2 ellipsis_0.3.0 RColorBrewer_1.1-2 BiocGenerics_0.32.0 [61] ggridges_0.5.2 Rcpp_1.0.4 plyr_1.8.6 base64enc_0.1-3 [65] zlibbioc_1.32.0 purrr_0.3.3 RCurl_1.98-1.1 ps_1.3.2 [69] FlowSOM_1.18.0 prettyunits_1.1.1 viridis_0.5.1 GetoptLong_0.1.8 [73] cowplot_1.0.0 S4Vectors_0.24.3 zoo_1.8-7 SummarizedExperiment_1.16.1 [77] haven_2.2.0 ggrepel_0.8.2 cluster_2.1.0 fda_2.4.8.1 [81] ncdfFlow_2.32.0 data.table_1.12.8 openxlsx_4.1.4 mvtnorm_1.1-0 [85] matrixStats_0.56.0 shinyjs_1.1 xtable_1.8-4 mime_0.9 [89] hms_0.5.3 XML_3.99-0.3 rio_0.5.16 jpeg_0.1-8.1 [93] mclust_5.4.5 readxl_1.3.1 IRanges_2.20.2 gridExtra_2.3 [97] shape_1.4.4 ggcyto_1.14.1 compiler_3.6.2 scater_1.14.6 [101] ellipse_0.4.1 flowStats_3.44.0 KernSmooth_2.23-16 crayon_1.3.4 [105] R.oo_1.23.0 htmltools_0.4.0 corpcor_1.6.9 pcaPP_1.9-73 [109] later_1.0.0 rrcov_1.5-2 RcppParallel_5.0.0 MASS_7.3-51.5 [113] Matrix_1.2-18 car_3.0-7 cli_2.0.2 R.methodsS3_1.8.0 [117] parallel_3.6.2 igraph_1.2.5 GenomicRanges_1.38.0 forcats_0.5.0 [121] pkgconfig_2.0.3 foreign_0.8-76 plotly_4.9.2 vipor_0.4.5 [125] XVector_0.26.0 drc_3.0-1 stringr_1.4.0 callr_3.4.3 [129] digest_0.6.25 tsne_0.1-3 ConsensusClusterPlus_1.50.0 graph_1.64.0 [133] cellranger_1.1.0 DelayedMatrixStats_1.8.0 curl_4.3 shiny_1.4.0.2 [137] gtools_3.8.2 rjson_0.2.20 lifecycle_0.2.0 jsonlite_1.6.1 [141] carData_3.0-3 BiocNeighbors_1.4.2 viridisLite_0.3.0 limma_3.42.2 [145] fansi_0.4.1 pillar_1.4.3 lattice_0.20-41 fastmap_1.0.1 [149] httr_1.4.1 plotrix_3.7-7 DEoptimR_1.0-8 pkgbuild_1.0.6 [153] survival_3.1-11 glue_1.3.2 remotes_2.1.1 zip_2.0.4 [157] png_0.1-7 Rgraphviz_2.30.0 stringi_1.4.6 nnls_1.4 [161] BiocSingular_1.2.2 CytoML_1.12.1 latticeExtra_0.6-29 irlba_2.3.3

On Mon, Apr 6, 2020 at 10:38 AM markrobinsonuzh notifications@github.com wrote:

@mimisikai https://github.com/mimisikai just to comment that, in general, "good questions get good answers" .. so the more details you can provide (e.g., an example that one of us can run to reproduce the error) will go a long way to help us and be faster for you to receive help.

Could you give as many details as is reasonable ? For example, the outputs of:

head(panel)

head(meta)

colnames(fcs_clean)

table(meta$id)

sessionInfo()

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/HelenaLC/CATALYST/issues/85#issuecomment-609936044, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABFHYA7A64FZUN6PTLXTWPDRLIHQNANCNFSM4K76LTUQ .

markrobinsonuzh commented 4 years ago

Here is a reproducible example:

library(CATALYST)
data(PBMC_fs, PBMC_panel, PBMC_md)
PBMC_md$sample_id[1:2] <- "BCRXL1"
prepData(PBMC_fs, PBMC_panel, PBMC_md)

> library(CATALYST)
> data(PBMC_fs, PBMC_panel, PBMC_md)
> PBMC_md$sample_id[1:2] <- "BCRXL1"
> prepData(PBMC_fs, PBMC_panel, PBMC_md)
Error in `levels<-`(`*tmp*`, value = as.character(levels)) : 
  factor level [5] is duplicated

Basically, it's because the id column (sample_id in my case) is meant to be a unique identifier of each sample (each row of your meta or my PBMC_md is a different sample), so is not allowed to be duplicated. This is NOT a bug and in the documentation ?prepData, it says:

      md: a table with column describing the experiment. An exemplary
          metadata table could look as follows:
            • ‘file_name’: the FCS file name
            • ‘sample_id’: a unique sample identifier
            • ‘patient_id’: the patient ID

So, one quick fix is to make meta$id unique .. for example:

meta$id <- paste0(meta$id, "_", seq_len(nrow(meta)))

joe-jhou2 commented 4 years ago

Right, sorry to bug you on this. It's my negligence on the meta data format. I've realized this data structure is not the same as previous one, but I still use the same meta data style. Thanks.

On Mon, Apr 6, 2020 at 11:41 AM markrobinsonuzh notifications@github.com wrote:

Here is a reproducible example:

library(CATALYST)

data(PBMC_fs, PBMC_panel, PBMC_md)

PBMC_md$sample_id[1:2] <- "BCRXL1"

prepData(PBMC_fs, PBMC_panel, PBMC_md)

library(CATALYST)

data(PBMC_fs, PBMC_panel, PBMC_md)

PBMC_md$sample_id[1:2] <- "BCRXL1"

prepData(PBMC_fs, PBMC_panel, PBMC_md)

Error in levels<-(*tmp*, value = as.character(levels)) :

factor level [5] is duplicated

Basically, it's because the id column (sample_id in my case) is meant to be a unique identifier of each sample (each row of your meta or my PBMC_md is a different sample), so is not allowed to be duplicated. This is NOT a bug and in the documentation ?prepData, it says:
  md: a table with column describing the experiment. An exemplary

      metadata table could look as follows:

        • ‘file_name’: the FCS file name

        • ‘sample_id’: a unique sample identifier

        • ‘patient_id’: the patient ID
So, one quick fix is to make meta$id unique .. for example:

meta$id <- paste0(meta$id, "_", seq_len(nrow(meta)))

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/HelenaLC/CATALYST/issues/85#issuecomment-609968473, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABFHYA3LQDIKPFA2G55EG4TRLIO4TANCNFSM4K76LTUQ .

HelenaLC / CATALYST

[REQ] data format #85