process idat files with R package gsrc

ekarlins commented 7 years ago

I've started looking into this again since my ultimate goal is to be able to start the pipeline with idat files without the need of using an Illumina GUI. This R package would allow for that. I've figured out how to process an Illumina csv file manifest and output two annotation files that are used in the vignette as "dictionary" and chrPos. These are simply data frames that can be created by reading in the csv files created by my script: "scripts/IllManifest2gsrc.py". I'll add the GSA example files to the repo.

ekarlins commented 7 years ago

Now we should be able to follow the gsrc vignette with the appropriate annotation. Here's how you can load the appropriate dataframes into R for annotation:

dictionary <- read.csv("files/gsrc/GSA_dictionary.csv", colClasses=c("integer", "character"))
chrPos <- read.csv("files/gsrc/GSA_chrPos.csv", colClasses=c("character", "character", "integer"))

Note, it's important to specify "colClasses" for the "chrPos" object. Without "colClasses" , the "chromosome" column gets read in as a factor variable and fails the "is.vector" test in the function "read_intensities"

ekarlins commented 7 years ago

I'm going to continue to work on this issue, but others should feel free to as well. My goal is to create two processes: 1) creating an object from many samples that can be used for normalization. 2) creating a way to process one sample, normalizing with the object from 1) above and outputting appropriate data for just the one sample.

ngiangre commented 7 years ago

@ekarlins can you describe the normalization rationale? I know when I initially coded this we used a test pennCNV input file which I didn't really know what exactly it represented. I'll try to work on this but was wondering if one load a multiple of those input files what are you actually normalizing?

ekarlins commented 7 years ago

@ngiangre, the normalization is needed to calculate theta. Theta is needed to calculate LRR and BAF. For the PennCNV input txt files that I created from the gtc files the normalization to create theta was done by the Illumina software that created the gtc file. So there was already a step that included a lot of samples to normalize (the "a lot of samples" was in the egt file produced by GenomeStudio).

For gsrc if we want to create theta we need to give it a bunch of samples. We need theta before we can get LRR and BAF. So this is figuring out a way to generate theta without using the Illumina software at all.

Does that make sense?

FYI, I ran through the gsrc vignette last night up to generating BAF and LRR on our data. There were a couple things that didn't function as expected, though I found fixes. This may have been because of the small number of input samples. Not sure. Here's the code I used to get the vignette working:

setwd("Scan2CNV")
library(gsrc)
dictionary <- read.csv("files/gsrc/GSA_dictionary.csv", colClasses=c("integer", "character"))
chrPos <- read.csv("files/gsrc/GSA_chrPos.csv", colClasses=c("character", "character", "integer"))
files <- list.files("files", pattern = "idat",full.names = TRUE)
column_names <- sapply(strsplit(files, split = "/"), FUN=function(x) x[length(x)])
head(dictionary)
head(chrPos)
raw_data <- read_intensities(files = files, 
                             dict = dictionary, 
                             cnames = column_names, 
                             pos = chrPos, beads = F,
                             sd = F)
str(raw_data)

boxplot(as.vector(raw_data$raw[, seq(1, length(raw_data$samples), 2)]),
        as.vector(raw_data$raw[, seq(2, length(raw_data$samples), 2)]),
        names = c("Green", "Red"))

norm_dat <- intens_theta(raw_data, norm = "both", scaling = "mean", transf = "log")
str(norm_dat)

head(norm_dat$samples)
norm_dat$samples <- head(norm_dat$samples, 3)
##There's something weird where it's giving a bunch of NA for samples.

norm_dat <- remove_suffix(norm_dat, "_Grn.idat")
head(norm_dat$samples)
hist(norm_dat$intensity, breaks = 1000)
hist(norm_dat$theta, breaks = 1000)

##the function geno_baf_rratio seems to expect a row for each SNP and column for each sample.
##The matrices for theta and intensity seem to be the opposite.
dim(norm_dat$theta) <- c(700078, 3)
dim(norm_dat$intensity) <- c(700078, 3)

norm_dat <- geno_baf_rratio(norm_dat)

str(norm_dat)

hist(norm_dat$baf, breaks = 1000)
tmp <- table(norm_dat$geno, useNA = "ifany")
barplot(tmp, names.arg = c(names(tmp)[1:4], "NA"))
hist(norm_dat$rratio, breaks = 1000)

ngiangre commented 7 years ago

Yeah, it makes sense that we need theta to get BAF and LRR.

During the hackathon we were trying to estimate theta but we didn’t have the dictionary. But now that you’ve made it we should be able to go straight through from idat files to CNV calling via gsrc.

So in generating theta we need to supply a lot of samples. Ok sounds good. Let me know if you want me to try something.

From: Eric Karlins notifications@github.com Reply-To: NCBI-Hackathons/Scan2CNV reply@reply.github.com Date: Tuesday, March 28, 2017 at 1:43 PM To: NCBI-Hackathons/Scan2CNV Scan2CNV@noreply.github.com Cc: Nick Giangreco nick.giangreco@gmail.com, Mention mention@noreply.github.com Subject: Re: [NCBI-Hackathons/Scan2CNV] process idat files with R package gsrc (#40)

@ngiangre, the normalization is needed to calculate theta. Theta is needed to calculate LRR and BAF. For the PennCNV input txt files that I created from the gtc files the normalization to create theta was done by the Illumina software that created the gtc file. So there was already a step that included a lot of samples to normalize (the "a lot of samples" was in the egt file produced by GenomeStudio).

For gsrc if we want to create theta we need to give it a bunch of samples. We need theta before we can get LRR and BAF. So this is figuring out a way to generate theta without using the Illumina software at all.

Does that make sense?

FYI, I ran through the gsrc vignette last night up to generating BAF and LRR on our data. There were a couple things that didn't function as expected, though I found fixes. This may have been because of the small number of input samples. Not sure. Here's the code I used to get the vignette working:

setwd("Scan2CNV") library(gsrc) dictionary <- read.csv("files/gsrc/GSA_dictionary.csv", colClasses=c("integer", "character")) chrPos <- read.csv("files/gsrc/GSA_chrPos.csv", colClasses=c("character", "character", "integer")) files <- list.files("files", pattern = "idat",full.names = TRUE) column_names <- sapply(strsplit(files, split = "/"), FUN=function(x) x[length(x)]) head(dictionary) head(chrPos) raw_data <- read_intensities(files = files, dict = dictionary, cnames = column_names, pos = chrPos, beads = F, sd = F) str(raw_data)

boxplot(as.vector(raw_data$raw[, seq(1, length(raw_data$samples), 2)]), as.vector(raw_data$raw[, seq(2, length(raw_data$samples), 2)]), names = c("Green", "Red"))

norm_dat <- intens_theta(raw_data, norm = "both", scaling = "mean", transf = "log") str(norm_dat)

head(norm_dat$samples) norm_dat$samples <- head(norm_dat$samples, 3)

There's something weird where it's giving a bunch of NA for samples.

norm_dat <- remove_suffix(norm_dat, "_Grn.idat") head(norm_dat$samples) hist(norm_dat$intensity, breaks = 1000) hist(norm_dat$theta, breaks = 1000)

the function geno_baf_rratio seems to expect a row for each SNP and column for each sample.

The matrices for theta and intensity seem to be the opposite.

dim(norm_dat$theta) <- c(700078, 3) dim(norm_dat$intensity) <- c(700078, 3)

norm_dat <- geno_baf_rratio(norm_dat)

str(norm_dat)

hist(norm_dat$baf, breaks = 1000) tmp <- table(norm_dat$geno, useNA = "ifany") barplot(tmp, names.arg = c(names(tmp)[1:4], "NA")) hist(norm_dat$rratio, breaks = 1000) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

ekarlins commented 7 years ago

I'm going through the code of the normalization functions in "gsrc" and trying to figure out how they work to see if there are ways to run these functions (or similar functions) one sample at a time. I know certain steps benefit from having many samples, let's say 1000. So one way would be to create one object that contained 1000 samples. Each time we run one sample also load the object containing 1000 samples. If we are running 150k samples and only have to create the object of 1000 samples once, that can provide some computational benefits. I'm hoping it's possible to strip down this object made from 1000 samples, though. Currently I estimate it would take about 90GB of RAM to load into R.

The first normalization comes from this function:

library(gsrc)
intens_theta <- function(raw, norm = "quantile",  scaling = "mean", transf = "log", pn = 2)

The vignette runs it with:

norm = "both"

You can see the code for this function here: https://github.com/cran/gsrc/blob/master/R/preprocess.R

The normalization steps are in lines 41 to 47 of the link above. They use this function for norm = "both":

normalize <- function(raw) {
    for (i in seq(2, ncol(raw), 2)) {
      raw[, c(i - 1, i)] <-
        preprocessCore::normalize.quantiles(raw[, c(i - 1, i)])
    }
    limma::normalizeMedianAbsValues(raw)
  }

For the object raw the rows are the samples and the columns are the SNPs.

dim(raw)
##[1]      6 700078

So for preprocessCore::normalize.quantiles it's doing this quantile normalization one SNP at a time, using red and green intensities from all of the samples. So it's a normalization across 1000 samples in our example.

The next normalization function is this:

limma::normalizeMedianAbsValues <- function (x) {
    narrays <- NCOL(x)
    if (narrays == 1) 
        return(x)
    cmed <- log(apply(abs(x), 2, median, na.rm = TRUE))
    cmed <- exp(cmed - mean(cmed))
    t(t(x)/cmed)
}

The 2 in the "apply" function indicates that this is running one column, or one SNP, at a time. This time running red and green separately, though. So again, a normalization across the 1000 samples.

It looks like the rest of the calculations in "intens_theta" would be the same whether running just one sample or many samples (maybe someone can confirm this).

So what I'd like to figure out is if it's possible to do the two normalization steps above without storing the full info from the matrix of 1000 samples.

For "normalizeMedianAbsValues" it seems like we could just calculate "cmed" and store it, which has a median value for each probe. Then we could just run that last line:

t(t(x)/cmed)

To give the normalized value. "x" is the intensities, normalized with quantile normalization.

I spoke with a statistician today who I'm hoping will give some more insight on if it's possible to run quantile normalization using only partial information from the set of 1000 samples. If we just keep the deciles, for example, would it work as well?

ngiangre commented 7 years ago

What your describing, if we keep only some instead of all information from the 1000 samples, sounds like asking whether we can store a sufficient statistic of the distributions (thinking each SNPs 'emits' a 1000 random variables making as many distributions as there are SNPs) which would represent the data without storing all the extra dimensionality. Great idea to talk to a statistician, they would know the answer best.

We can make a paralleled version of this by using 'mcapply' which can take advantage of multiple cores. Maybe you can try it out on your cluster? Atleast it'll be faster, but it won't avoid the memory issue...

We can delete the object after reducing the matrix so atleast it's not in memory for long. I'll have to think more about this.

ekarlins commented 7 years ago

Nick, yes. You get what I'm saying about storing just the relevant info for the 1000 samples.

As far as multithreading in R, my hunch is that it's not worth spending a lot of effort on this. Other programming languages will be much better at parallel processing. And as you mentioned we'd still have a lot of data stored in memory, which makes it hard for R to run efficiently. Plus if you are running on a cluster the R job will likely only know about available cores on the same node as your job. Which doesn't take advantage of the power of having an HPC cluster.

For our purposes, we're running parallel processing on a compute cluster or potentially on a cloud. This is a much cleaner way for us to parallelize R code. To actually submit a separate job for each appropriate unit. This is what Snakemake is doing for our pipeline, and I could imagine writing a CWL version of the Snakefile for cloud computing.

To that end, a simpler way of shrinking the data for the 1000 samples may just be to split it by SNP or genomic region for steps processed across samples. And to split by sample for steps processed within samples and across SNPs.

If we're thinking about scaling this pipeline for 1 million or more samples, running one job per SNP (700k jobs for GSA) shouldn't be a big deal. And spinning off 700k jobs at once on the cloud should also be a good fit. For HPC clusters whether it's efficient to run this many jobs at once will depend on the cluster, so maybe we'll need to have options for how it splits the data.

Splitting jobs by SNP will also allow us to use data from all of the samples as part of the normalization, not just 1000 samples, which may not represent all possible batch effects.

Eric

ngiangre commented 7 years ago

Sounds great! I bet snakemake has great functionality for spinning off that many jobs. Seems like we should change the functions to accommodate this. But would we output the value to a temporary file before bring it all together again?

From: Eric Karlins notifications@github.com Reply-To: NCBI-Hackathons/Scan2CNV reply@reply.github.com Date: Friday, March 31, 2017 at 5:16 PM To: NCBI-Hackathons/Scan2CNV Scan2CNV@noreply.github.com Cc: Nick Giangreco nick.giangreco@gmail.com, Mention mention@noreply.github.com Subject: Re: [NCBI-Hackathons/Scan2CNV] process idat files with R package gsrc (#40)

Nick, yes. You get what I'm saying about storing just the relevant info for the 1000 samples.

As far as multithreading in R, my hunch is that it's not worth spending a lot of effort on this. Other programming languages will be much better at parallel processing. And as you mentioned we'd still have a lot of data stored in memory, which makes it hard for R to run efficiently. Plus if you are running on a cluster the R job will likely only know about available cores on the same node as your job. Which doesn't take advantage of the power of having an HPC cluster.

For our purposes, we're running parallel processing on a compute cluster or potentially on a cloud. This is a much cleaner way for us to parallelize R code. To actually submit a separate job for each appropriate unit. This is what Snakemake is doing for our pipeline, and I could imagine writing a CWL version of the Snakefile for cloud computing.

To that end, a simpler way of shrinking the data for the 1000 samples may just be to split it by SNP or genomic region for steps processed across samples. And to split by sample for steps processed within samples and across SNPs.

If we're thinking about scaling this pipeline for 1 million or more samples, running one job per SNP (700k jobs for GSA) shouldn't be a big deal. And spinning off 700k jobs at once on the cloud should also be a good fit. For HPC clusters whether it's efficient to run this many jobs at once will depend on the cluster, so maybe we'll need to have options for how it splits the data.

Splitting jobs by SNP will also allow us to use data from all of the samples as part of the normalization, not just 1000 samples, which may not represent all possible batch effects.

Eric

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

ekarlins commented 7 years ago

@ngiangre, yes, for Snakemake we'd have to write to files after each step and bring it all together. So I/O would certainly be a cost of doing it this way. The time to submit a lot of jobs could be another cost. But yes, we can make the intermediate files temp files.

Yep, it will involve rewriting the functions. I'm not sure "gsrc" is the best package to use for this. More on that in a separate comment.

ekarlins commented 7 years ago

So I've been talking to a statistician about quantile normalization and about low level operations in "gsrc". I think we uncovered an issue with "gsrc" that makes it not suitable for our purposes. It's possible that it's because this package is really only for "Genome Structure Rearrangement Calling in Genomes with High Synteny", which is not what we're looking for. So maybe it's not flexible for any CNV calling.

The quantile normalization step seems to be running one SNP at a time, across all samples, and including both the Red and Green signal. Suppose you have a SNP where all samples have the AA genotype. Your matrix of Red and Green signals, across samples at this SNP may look like this:

raw <- cbind(runif(10,6000,6001),runif(10,100,101))
raw
          [,1]     [,2]
[1,] 6000.348 100.0071
[2,] 6000.868 100.7038
[3,] 6000.544 100.7438
[4,] 6000.777 100.8689
[5,] 6000.987 100.4893
[6,] 6000.501 100.3422
[7,] 6000.239 100.2562
[8,] 6000.615 100.6452
[9,] 6000.906 100.9429
[10,] 6000.315 100.3039

The first column indicates the signal for "A" allele and the second the signal for "B" allele. So all samples are clearly "AA". Now you run the quantile normalization using the code from "gsrc" and you get this:

> raw2=preprocessCore::normalize.quantiles(raw)
> raw2
          [,1]     [,2]
[1,] 3050.326 3050.123
[2,] 3050.806 3050.740
[3,] 3050.517 3050.806
[4,] 3050.740 3050.888
[5,] 3050.965 3050.517
[6,] 3050.422 3050.422
[7,] 3050.123 3050.286
[8,] 3050.630 3050.630
[9,] 3050.888 3050.965
[10,] 3050.286 3050.326

What?! It makes them all look like "AB" genotypes. This doesn't seem right.

ngiangre commented 7 years ago

Yeah that's really strange. Do you think we should change the functions or look for another package?

Sent from my iPhone

On Mar 31, 2017, at 10:50 PM, Eric Karlins notifications@github.com wrote:

So I've been talking to a statistician about quantile normalization and about low level operations in "gsrc". I think we uncovered an issue with "gsrc" that makes it not suitable for our purposes. It's possible that it's because this package is really only for "Genome Structure Rearrangement Calling in Genomes with High Synteny", which is not what we're looking for. So maybe it's not flexible for any CNV calling.

The quantile normalization step seems to be running one SNP at a time, across all samples, and including both the Red and Green signal. Suppose you have a SNP where all samples have the AA genotype. Your matrix of Red and Green signals, across samples at this SNP may look like this:

raw <- cbind(runif(10,6000,6001),runif(10,100,101)) raw [,1] [,2] [1,] 6000.348 100.0071 [2,] 6000.868 100.7038 [3,] 6000.544 100.7438 [4,] 6000.777 100.8689 [5,] 6000.987 100.4893 [6,] 6000.501 100.3422 [7,] 6000.239 100.2562 [8,] 6000.615 100.6452 [9,] 6000.906 100.9429 [10,] 6000.315 100.3039 The first column indicates the signal for "A" allele and the second the signal for "B" allele. So all samples are clearly "AA". Now you run the quantile normalization using the code from "gsrc" and you get this:

raw2=preprocessCore::normalize.quantiles(raw) raw2 [,1] [,2] [1,] 3050.326 3050.123 [2,] 3050.806 3050.740 [3,] 3050.517 3050.806 [4,] 3050.740 3050.888 [5,] 3050.965 3050.517 [6,] 3050.422 3050.422 [7,] 3050.123 3050.286 [8,] 3050.630 3050.630 [9,] 3050.888 3050.965 [10,] 3050.286 3050.326 What?! It makes them all look like "AB" genotypes. This doesn't seem right.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

ekarlins commented 7 years ago

I think we should explore other packages and then come back to that question. Crlmm is a good one to look at. Any others for the normalization steps?

ngiangre commented 7 years ago

I’ll look at normalization with crlmm and some other packages. What specifically should I be looking for?

From: Eric Karlins notifications@github.com Reply-To: NCBI-Hackathons/Scan2CNV reply@reply.github.com Date: Saturday, April 1, 2017 at 8:19 AM To: NCBI-Hackathons/Scan2CNV Scan2CNV@noreply.github.com Cc: Nick Giangreco nick.giangreco@gmail.com, Mention mention@noreply.github.com Subject: Re: [NCBI-Hackathons/Scan2CNV] process idat files with R package gsrc (#40)

I think we should explore other packages and then come back to that question. Crlmm is a good one to look at. Any others for the normalization steps?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

ekarlins commented 7 years ago

I think the main thing right now is to show what function(s) is being used for normalization and the "unit" that it's normalizing. So for gsrc we would say that it runs preprocessCore::normalize.quantiles one SNP at a time, across all samples, and including both red and green intensities. Then it runs limma::normalizeMedianAbsValues one SNP at a time, across all samples, and separately for red and green.

ngiangre commented 7 years ago

Ok, cool. I’ll try to search through some functions

From: Eric Karlins notifications@github.com Reply-To: NCBI-Hackathons/Scan2CNV reply@reply.github.com Date: Saturday, April 1, 2017 at 7:27 PM To: NCBI-Hackathons/Scan2CNV Scan2CNV@noreply.github.com Cc: Nick Giangreco nick.giangreco@gmail.com, Mention mention@noreply.github.com Subject: Re: [NCBI-Hackathons/Scan2CNV] process idat files with R package gsrc (#40)

I think the main thing right now is to show what function(s) is being used for normalization and the "unit" that it's normalizing. So for gsrc we would say that it runs preprocessCore::normalize.quantiles one SNP at a time, across all samples, and including both red and green intensities. Then it runs limma::normalizeMedianAbsValues one SNP at a time, across all samples, and separately for red and green.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

ekarlins commented 7 years ago

Started issue #41 for crlmm normalization.

NCBI-Hackathons / Scan2CNV