rethink caching approach for tabix files

alexg9010 commented 3 years ago

It seems as if the caching of uncompressed files introduced with version 1.13.2 causes some problems if the user works with tabix files for more than one CpG context and wants to convert those into memory objects. See here: https://groups.google.com/g/methylkit_discussion/c/UruFjvX89B4/m/_aMsqBC-DwAJ

The problem is caused by the fread.gzipped() function and arises when any two tabix files with the same basename are used in the same session. Once one tabix file is read with fread.gzipped, it will be uncompressed and stored in a (session specific) temporary location for the first time, but subsequent calls to fread.gzipped will reuse the cached uncompressed file. If now, another tabix file with the same basename is supposed to be uncompressed, the cached file with the same basename will be read. Unfortunately this happens unnoticed, as missing rows will be filled with NA's and might cause unnoticed issues downstream.

One (hopefully) simple idea to mitigate this would be to calculate hashes for the compressed file that could become part of the name of the cached files. However we need to make sure to ignore the tabix files header if present.

[ ] calculate hashes for the compressed file
[ ] ignore present tabix files header
[ ] make hash part of cached files name

al2na commented 3 years ago

this is a crazy bug : ) !! Do you know any alternatives? like delayedMatrix to move away from tabix, last time I checked they didn't have this overlap functionality we use with tabix.

On Tue, Jan 26, 2021 at 4:22 PM Alexander Blume (Gosdschan) < notifications@github.com> wrote:

It seems as if the caching of uncompressed files introduced with version 1.13.2 causes some problems if the user works with tabix files for more than one CpG context and wants to convert those into memory objects. See here: https://groups.google.com/g/methylkit_discussion/c/UruFjvX89B4/m/_aMsqBC-DwAJ

The problem is caused by the fread.gzipped() function and arises when any two tabix files with the same basename are used in the same session. Once one tabix file is read with fread.gzipped, it will be uncompressed and stored in a (session specific) temporary location for the first time, but subsequent calls to fread.gzipped will reuse the cached uncompressed file. If now, another tabix file with the same basename is supposed to be uncompressed, the cached file with the same basename will be read. Unfortunately this happens unnoticed, as missing rows will be filled with NA's and might cause unnoticed issues downstream.

One (hopefully) simple idea to mitigate this would be to calculate hashes for the compressed file that could become part of the name of the cached files. However we need to make sure to ignore the tabix files header if present.

calculate hashes for the compressed file

ignore present tabix files header

make hash part of cached files name

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/al2na/methylKit/issues/222, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAE32EPWI5WXA73DZPOLORTS33M3TANCNFSM4WTT7UBQ .

alexg9010 commented 3 years ago

The easiest fix for this bug would be to disable the caching of the uncompressed files for now and just overwrite the uncompressed file every time. But you are right, maybe it is time to switch to more supported backends that are externally developed.

alexg9010 commented 3 years ago

Collecting some ideas:

matter package for Rapid prototyping with data on disk

The matter package is designed with several goals in mind. Like the bigmemory and ff packages, it seeks to make statistical methods scalable to larger-than-memory datasets by utilizing data-on-disk. Unlike those packages, it seeks to make domain-specific file formats (such as Analyze 7.5 and imzML for MS imaging experiments) accessible from disk directly without additional file conversion. It seeks to have a minimal memory footprint, and require minimal developer effort to use, while maintaining computational efficiency wherever possible.

(https://bioconductor.org/packages/3.12/bioc/vignettes/matter/inst/doc/matter.pdf) includes:

Principal components analysis for on-disk datasets
Linear regression for on-disk datasets

DelayedArray

(https://petehaitch.github.io/BioC2020_DelayedArray_workshop/articles/Effectively_using_the_DelayedArray_framework_for_users.html)

could be used with any future on disk backend
developed by BioConductor Team

al2na / methylKit

rethink caching approach for tabix files #222

matter package for Rapid prototyping with data on disk

DelayedArray