HenrikBengtsson / affxparser

🔬 R package: This is the Bioconductor devel version of the affxparser package.
http://bioconductor.org/packages/devel/bioc/html/affxparser.html
7 stars 3 forks source link

CORE DUMP: readCel() on Calvin/v4 CEL file may crash/core dump R #13

Closed HenrikBengtsson closed 9 years ago

HenrikBengtsson commented 9 years ago

This works:

library("affxparser")
pathname <- "GSM1199232_AB227-HuGene-1_0-st-v1-01-1_A6_.CEL"
data <- readCel(pathname)  # which is same as (see Appendix):
data <- readCel(pathname, readStdvs=FALSE, readPixels=FALSE)

but the following causes R to crash:

data <- readCel(pathname, readStdvs=TRUE, readPixels=FALSE)
data <- readCel(pathname, readStdvs=FALSE, readPixels=TRUE)
data <- readCel(pathname, readStdvs=TRUE, readPixels=TRUE)

The value of readHeader, readXY and readIntensities makes no difference.

On Linux, one gets:

 *** caught segfault ***
address 0x2af97eb48000, cause 'memory not mapped'

Traceback:
 1: .Call("R_affx_get_cel_file", filename, readHeader, readIntensities, readXY, readXY, readPixels, readStdvs, readOutliers, readMasked, indices, as.integer(verbose), PACKAGE = "affxparser")
 2: readCel(pathname, readHeader = TRUE, readXY = TRUE, readIntensities = TRUE, readStdvs = TRUE, readPixels = FALSE)

Possible actions:
1: abort (with core dump, if enabled)
2: normal R exit
3: exit R without saving workspace
4: exit R saving workspace

On Windows 64-bit, one gets:

terminate called after throwing an instance of 'std::out_of_range'
  what():  vector::_M_range_check

This application has requested the Runtime to terminate it in an unusual way.
Please contact the application's support team for more information.

Appendix

Arguments of readCel():

> args(readCel)
function (filename, indices = NULL, readHeader = TRUE, readXY = FALSE,
    readIntensities = TRUE, readStdvs = FALSE, readPixels = FALSE,
    readOutliers = TRUE, readMasked = TRUE, readMap = NULL, verbose = 0,
    .checkArgs = TRUE)

This CEL file can be downloaded as:

library("R.utils")
path <- "ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM1199nnn/GSM1199232/suppl/"
url <- file.path(path, "GSM1199232_AB227-HuGene-1_0-st-v1-01-1_A6_.CEL.gz")
pathname <- gunzip(downloadFile(url))
HenrikBengtsson commented 9 years ago

From more troubleshooting, I'm pretty sure 'GSM1199232_AB227-HuGene-1_0-st-v1-01-1A6.CEL' is a corrupt CEL file. More specifically, it is a truncated file. After having read the "intensities", the following data section is "stddevs" which is only partly read before it reaches the end of the file - (it reads 194,108 bytes out of 4,410,000 wanted). This is also indicated when using the following alternative to read the file:

> library("affxparser")
> pathname <- "GSM1199232_AB227-HuGene-1_0-st-v1-01-1_A6_.CEL"
> hdr <- readCelHeader(pathname)
> nbrOfCells <- hdr$cols * hdr$rows
> nbrOfBytes <- 4 * nbrOfCells  # stored as float:s
> nbrOfBytes
[1] 4410000
> data <- readCcg(pathname)
Error in dim(raw) <- c(bytesPerRow, nbrOfRows) :
  dims [product 4410000] do not match the length of object [194108]

More evidence. This sample is part of data set GSE49439. Comparing the file size of the CEL.gz files for a few of the samples in that set, this file stands out, e.g.

  1. http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM1199231 [4.6Mb]
  2. http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM1199232 [2.3Mb]
  3. http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM1199233 [4.7Mb]

So, 99.99999% sure it's a corrupt CEL file.