HenrikBengtsson / illuminaio

🔬 R package: This is the Bioconductor devel version of the illuminaio package.
http://bioconductor.org/packages/devel/bioc/html/illuminaio.html
6 stars 2 forks source link

Avoid warning on null terminated string when reading IDAT file #21

Closed HenrikBengtsson closed 2 years ago

HenrikBengtsson commented 2 years ago

Moving discussion in https://github.com/HenrikBengtsson/illuminaio/commit/3d0b5619199d5ff428bf7c5aa25731ecf30f4669 to a proper issue, in order not to lose the background and the rationale for the workaround/solution.

@kasperdanielhansen committed (commit 3d0b561):

Added suppressWarnings() on readChar to ignore a null terminated string. Otherwise, illuminaio emits a warning when reading an Illumina IDAT file (at least some of the newer methylation arrays).

by adding suppressWarnings() in:

suppressWarnings(readChar(con, nchars=n))

@HenrikBengtsson replied:

Is this safe? Could it be that the warning on reading nulls in strings an indication that something else needs to be fixed? By suppressing such warnings, we'll no longer be aware of other cases and problem.

@kasperdanielhansen replied:

I would say it's pretty safe, and it is impossible to avoid, see ?readChar.

But just because. its a good point, I have identified the specific field in the IDAT file which gives this warning and I have moved the suppressWarnings from readString to the specific field. Updated to version 0.39.0-9002 and pushed.

by moving suppressWarnings() to:

   res <- lapply(res, function(xx) {
        where <- fields[xx, "byteOffset"]
        seek(con, where = where, origin = "start")
        readField(con = con, field = xx)
        if(xx == "Unknown.6")
            suppressWarnings(readField(con = con, field = xx))
        else
            readField(con = con, field = xx)
    })

@HenrikBengtsson replied:

Maybe there's a simple explanation to having nulls in strings, e.g. maybe it's always the n:th character that is null, and one should read the string as:

s <- readChar(con, nchars = n)
nul <- readBin(con, what = "character", n = 1L)

If the null is in the middle of the expected string, it could be worth understanding why.

Also, suppressWarning() is an expensive function, but I forgot what the exact layout of IDAT files is, so I'm not sure how many times it ends up being called in the end of the day.

@kasperdanielhansen replied:

It gets called one time per file.

HenrikBengtsson commented 2 years ago

@kasperdanielhansen, I'd like to add a source-code comment with details for why this suppressWarning() is introduced. Exactly what is the symptom. You mentioned "warning" in your commit, but I get an error if I do something like:

con <- c(charToRaw("abc"), as.raw(0L))

readChar(con, nchars=3L)
#> [1] "abc"

readChar(con, nchars=4L)
#> Error in readChar(con, nchars = 4L) : embedded nul in string: 'abc\0'
kasperdanielhansen commented 2 years ago

Standard EPIC IDATs have the warning, here from a package: tmp = readIDAT(list.files(system.file("extdata/200144450019", package = "minfiDataEPIC"), full = TRUE)[1])

So the output "should" be stored in tmp$Unknowns$Unknown.6 but this is not even returned to the user (which is clearly an oversight I think, and I also think we should return Unknown.7). The value of these two fields are both "" (you need to debug to get to it, because it is not returned).

On Thu, Sep 15, 2022 at 3:04 PM Henrik Bengtsson @.***> wrote:

@kasperdanielhansen https://github.com/kasperdanielhansen, I'd like to add a source-code comment with details for why this suppressWarning() is introduced. Exactly what is the symptoms. You mentioned "warning" in your commit, but I get an error if I do something like:

con <- c(charToRaw("abc"), as.raw(0L))

readChar(con, nchars=3L)#> [1] "abc"

readChar(con, nchars=4L)#> Error in readChar(con, nchars = 4L) : embedded nul in string: 'abc\0'

— Reply to this email directly, view it on GitHub https://github.com/HenrikBengtsson/illuminaio/issues/21#issuecomment-1248492000, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABF2DH4RON7MSTNQ5T4XNGTV6NXLVANCNFSM6AAAAAAQNVSPHU . You are receiving this because you were mentioned.Message ID: @.***>

-- Best, Kasper

HenrikBengtsson commented 2 years ago

... and what's the warning message exactly?

kasperdanielhansen commented 2 years ago

Warning message: In readChar(con, nchars = n) : truncating string with embedded nuls

On Thu, Sep 15, 2022 at 5:50 PM Henrik Bengtsson @.***> wrote:

... and what's the warning message exactly?

— Reply to this email directly, view it on GitHub https://github.com/HenrikBengtsson/illuminaio/issues/21#issuecomment-1248668581, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABF2DHY7DQTQND5SPNQN6CLV6OKZPANCNFSM6AAAAAAQNVSPHU . You are receiving this because you were mentioned.Message ID: @.***>

-- Best, Kasper

HenrikBengtsson commented 2 years ago

UPDATE:

FWIW, I found https://github.com/bioinformed/glu-genetics/blob/master/glu/lib/illumina.py#L44-L61; it seems that they've identified/decided on what some of the "unknown" fields are.

HenrikBengtsson commented 2 years ago

illuminaio 0.39.1 has now been pushed to Bioc devel.

HenrikBengtsson commented 2 years ago

Just for the record, one of the IDAT files in IlluminaDataTestFiles has produced this warning all along, e.g.

> library(illuminaio)
TRACKER:  loadedNamespaces() changed:  4 packages loaded ('askpass', 'base64', 'illuminaio', 'openssl')
> file <- system.file("extdata", "idat", "5723646052_R02C02_Grn.idat", package = "IlluminaDataTestFiles")
> idat <- illuminaio::readIDAT(file)
Warning message:
In readChar(con, nchars = n) : truncating string with embedded nuls
Calls: <Anonymous> ... lapply -> FUN -> readField -> readString -> readChar

In illuminaio (>= 0.39.1) it no longer will:

> idat <- illuminaio::readIDAT(file)
> str(idat$Unknowns)
List of 9
 $ MostlyNull: chr ""
 $ MostlyA   : chr "R02C02"
 $ Unknown.1 : chr ""
 $ Unknown.6 : int [1:2] 1 0
 $ Unknown.2 : chr ""
 $ Unknown.3 : chr ""
 $ Unknown.4 : chr ""
 $ Unknown.5 : chr ""
 $ Unknown.7 : chr ""