CompEpigen / DecompPipeline

Large automated pipeline for running the MeDeCom
GNU General Public License v3.0
4 stars 6 forks source link

Mitigate excessive filtering of sites #4

Closed PoisonAlien closed 5 years ago

PoisonAlien commented 5 years ago

Hi @lutsik and @schmic05 ,

I am running DecompPipeline on EPIC arrays, and it seems that most of the probes (~90%) are filtered due to intensity filtering. Here is the log,

> md_res <- DecompPipeline::start_decomp_pipeline(rnb.set = tall_rnb_set,
                                Ks = 2:10,
                                lambda.grid = c(0.01,0.001),
                                factorviz.outputs = TRUE,
                                marker.selection = c("houseman2012","var"),
                                n.markers = 30000,
                                min.n.beads = 2,
                                min.int.quant = 0.05,
                                max.int.quant = 0.95,
                                filter.na = TRUE,
                                filter.snp  = TRUE,
                                filter.context = FALSE,
                                filter.somatic = FALSE,
                                normalization = "wm.dasen", cores = 7)

## 2019-03-01 10:48:40    22.8    INFO 76147 sites removed in bead count filtering.
## 2019-03-01 10:49:08    24.0    INFO 738009 sites removed in intensity filtering.
## 2019-03-01 10:49:11    25.2    INFO 0 sites removed in NA filtering
## 2019-03-01 10:49:11    25.2    INFO 7832 sites removed in SNP filtering
## 2019-03-01 10:49:11    25.2    INFO Removing 821988 sites, retaining  44907
## [1] "Did not write the variable dump: should only be executed from an environment with all the variables set"
## [2019-03-01 10:51:04, Main:] checking inputs
## [2019-03-01 10:51:04, Main:] preparing data
## [2019-03-01 10:51:04, Main:] preparing jobs
## [2019-03-01 10:51:04, Main:] 396 factorization runs in total
## [2019-03-03 01:07:35, Main:] finished all jobs. Creating the object

Is it a bug ? I lowered min.n.beads argument to 2 and yet it removed so much of the probes. Should I disable any of the arguments ?

In case you need to know, input tall_rnb_set is an Rnbeads object created with following command.

> RnBeads::rnb.options(
  identifiers.column = "Sample_Name",
  disk.dump.big.matrices = FALSE,
  normalization.method = "bmiq",
  normalization.background.method = "methylumi.noob",
  filtering.cross.reactive = TRUE, filtering.snp = "3",
  inference.reference.methylome.column="Cell_Type",
  import.table.separator = "\t")

> logger.start(fname = NA)
> data_source = c(idat_dir, sample_anno)

> tall_rnb_set <- RnBeads::rnb.run.import(data.source = data_source, 
                                        data.type = "idat.dir", dir.reports = report_dir)

## 2019-03-01 10:27:44     1.5  STATUS STARTED Loading Data
## 2019-03-01 10:27:44     1.5    INFO     Number of cores: 7
## 2019-03-01 10:27:44     1.5    INFO     Loading data of type "idat.dir"
## 2019-03-01 10:27:45     1.5  STATUS     STARTED Loading Data from IDAT Files
## 2019-03-01 10:27:45     1.5    INFO         Added column barcode to the provided sample annotation table
## 2019-03-01 10:27:49     1.5    INFO         Detected platform: MethylationEPIC
## 2019-03-01 10:32:04    10.9  STATUS     COMPLETED Loading Data from IDAT Files
## 2019-03-01 10:43:54    13.0  STATUS     Loaded data from /home/anand/C010-Datasets/Internal/Aurore_TALL/01_raw_data/EPIC_arrays/idat/
## 2019-03-01 10:43:58    19.6  STATUS     Predicted sex for the loaded samples
## 2019-03-01 10:44:01    15.1  STATUS     Added data loading section to the report
## 2019-03-01 10:44:01    15.1  STATUS     Loaded 160 samples and 866895 sites
## 2019-03-01 10:44:01    15.1    INFO     Output object is of type RnBeadRawSet
## 2019-03-01 10:44:01    15.1  STATUS COMPLETED Loading Data
schmic05 commented 5 years ago

Hi @PoisonAlien Thanks for reporting this. It is not a bug in the software, DecompPipeline follows a very stringent way of probe filtering: It determines for each probe if any of the samples is outside of the quantile range, and removes this probe if only one sample does not follow the criterion. Thus, setting min.int.quant=0.05 and max.int.quant=0.95 is pretty stringent. If you want to remove fewer sites, I'd recommend to have values of 0.01/0.99 or even 0.001/0.999. The min.n.beads option is independent of the intensity filtering.

PoisonAlien commented 5 years ago

Got it! Thank you.