Open alexg9010 opened 3 years ago
this is a crazy bug : ) !! Do you know any alternatives? like delayedMatrix to move away from tabix, last time I checked they didn't have this overlap functionality we use with tabix.
On Tue, Jan 26, 2021 at 4:22 PM Alexander Blume (Gosdschan) < notifications@github.com> wrote:
It seems as if the caching of uncompressed files introduced with version 1.13.2 causes some problems if the user works with tabix files for more than one CpG context and wants to convert those into memory objects. See here: https://groups.google.com/g/methylkit_discussion/c/UruFjvX89B4/m/_aMsqBC-DwAJ
The problem is caused by the fread.gzipped() function and arises when any two tabix files with the same basename are used in the same session. Once one tabix file is read with fread.gzipped, it will be uncompressed and stored in a (session specific) temporary location for the first time, but subsequent calls to fread.gzipped will reuse the cached uncompressed file. If now, another tabix file with the same basename is supposed to be uncompressed, the cached file with the same basename will be read. Unfortunately this happens unnoticed, as missing rows will be filled with NA's and might cause unnoticed issues downstream.
One (hopefully) simple idea to mitigate this would be to calculate hashes for the compressed file that could become part of the name of the cached files. However we need to make sure to ignore the tabix files header if present.
- calculate hashes for the compressed file
- ignore present tabix files header
- make hash part of cached files name
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/al2na/methylKit/issues/222, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAE32EPWI5WXA73DZPOLORTS33M3TANCNFSM4WTT7UBQ .
The easiest fix for this bug would be to disable the caching of the uncompressed files for now and just overwrite the uncompressed file every time. But you are right, maybe it is time to switch to more supported backends that are externally developed.
Collecting some ideas:
The matter package is designed with several goals in mind. Like the bigmemory and ff packages, it seeks to make statistical methods scalable to larger-than-memory datasets by utilizing data-on-disk. Unlike those packages, it seeks to make domain-specific file formats (such as Analyze 7.5 and imzML for MS imaging experiments) accessible from disk directly without additional file conversion. It seeks to have a minimal memory footprint, and require minimal developer effort to use, while maintaining computational efficiency wherever possible.
(https://bioconductor.org/packages/3.12/bioc/vignettes/matter/inst/doc/matter.pdf) includes:
It seems as if the caching of uncompressed files introduced with version 1.13.2 causes some problems if the user works with tabix files for more than one CpG context and wants to convert those into memory objects. See here: https://groups.google.com/g/methylkit_discussion/c/UruFjvX89B4/m/_aMsqBC-DwAJ
The problem is caused by the
fread.gzipped()
function and arises when any two tabix files with the same basename are used in the same session. Once one tabix file is read with fread.gzipped, it will be uncompressed and stored in a (session specific) temporary location for the first time, but subsequent calls to fread.gzipped will reuse the cached uncompressed file. If now, another tabix file with the same basename is supposed to be uncompressed, the cached file with the same basename will be read. Unfortunately this happens unnoticed, as missing rows will be filled with NA's and might cause unnoticed issues downstream.One (hopefully) simple idea to mitigate this would be to calculate hashes for the compressed file that could become part of the name of the cached files. However we need to make sure to ignore the tabix files header if present.