HenrikBengtsson / illuminaio

🔬 R package: This is the Bioconductor devel version of the illuminaio package.
http://bioconductor.org/packages/devel/bioc/html/illuminaio.html
6 stars 2 forks source link

Bugs in RunInfo Parsing #2

Closed brazilbean closed 9 years ago

brazilbean commented 9 years ago

I'm trying to analyze some IDAT files in python, and your R package has been very helpful. Thanks you! Two things I've noticed in the process:

1) The run info loop [of readIDAT_nonenc()] should iterate from 1:nRunInfoBlocks, not 1:2. Current:

for (ii in 1:2) { 
    for (jj in 1:5) {
        RunInfo[ii,jj] <- readString(con = con)
    }
}

Should be:

for (ii in 1:nRunInfoBlocks) { 
    for (jj in 1:5) {
        RunInfo[ii,jj] <- readString(con = con)
    }
}

2) The readString function needs to implement the commented algorithm or it will fail when trying to read all the runInfo blocks. Note, the second byte described denotes the total number of blocks, not the number of additional blocks. My python implementation (derived from your R implementation, but with the fix) reads:

def readString(f):
    '''Read a string. The length of the string is encoded in the first byte.'''
    n = readByte(f)
    # Check if high-bit is set
    if n & 1<<7 == 0:
        return f.read(n).decode()
    else:
        nblocks = readByte(f)-1
        out = [f.read(n)]
        for ii in range(nblocks):
            out.append(readBlockString(f))
        return b''.join(out).decode()

Super helpful package. Keep up the good work!

HenrikBengtsson commented 9 years ago

Thanks for the report and proposed fix. I'll see who can look into this. If you could point us to an IDAT file for which readIDAT() failes/gives an error, that would be extremely helpful for our troubleshooting/bug fixing. Even better if it's a public file, because then we can add it to our redundancy tests.

For the record:

  1. The for loop is internal of readIDAT_nonenc()#L61-L65
  2. The readString() function is a local function of readIDAT_nonenc()#L18-L32. It's interesting to note we/HB already made some type of comments back in 2011 about also handling the case where the hi-bit is set. This seems to match what @brazilbean proposes. See source code comment.
HenrikBengtsson commented 9 years ago

I've finally got around to fix this. I've pushed illuminaio v0.11.2 to BioC devel. Should be available to end users in a day or two. Can be installed already now as source("http://callr.org/install#HenrikBengtsson/illuminaio").

Unfortunately, we don't have any IDAT files for verifying that IDAT strings of length 128 or longer can be read. @brazilbean, do you have such IDAT files that you test with this new version?

@kasperdanielhansen, IlluminaDataTestFiles should be updates so that the hack in the tests/readIDAT_370k.R test script can be avoided; the "truth" from IlluminaDataTestFiles is actually not correct.