RfastOfficial / Rfast

A collection of Rfast functions for data analysis. Note 1: The vast majority of the functions accept matrices only, not data.frames. Note 2: Do not have matrices or vectors with have missing data (i.e NAs). We do no check about them and C++ internally transforms them into zeros (0), so you may get wrong results. Note 3: In general, make sure you give the correct input, in order to get the correct output. We do no checks and this is one of the many reasons we are fast.
139 stars 19 forks source link

Function Rfast::chi2Test_univariate crashing with large input data #51

Open peterbourke opened 2 years ago

peterbourke commented 2 years ago

Hi,

I am looking for a fast approach to run multiple X2 or G2 tests of independence for a particular application in linkage mapping, namely clustering markers into chromosomal linkage groups. I came across your package Rfast and it works perfectly on small (simulated) examples. However, I wanted to scale up to a larger (and more realistic) examples, so I made a simulated dataset of the same dimensionality as a real dataset I am working with.

I've tried a couple of times but each time the function Rfast::chi2Test_univariate crashed and R itself aborts. I usually use Rstudio (on Windows 10) but tried using RGui also, with the same result. This is a "fatal error" that causes R to abort and quit immediately without warning.

If the scale of the dataset I provide below is just too large for your package, perhaps you could include such warnings to users in a future release. There are 29,000 markers that I would like to test for independence, across a population of 975 individuals. So there would be "29000 choose 2" tests needed, with each pair being compared across 975 paired observations. I realise this is a huge number of tests, about 385 million tests in total. Of course, most of these tests will not be significant. I am not worried about multiple testing issues, that is not the point here. There are 5 possible classes / levels for each, ranging in value from 0 to 4.

To Reproduce I would like to attach the dataset saved as an .RDS and a small .R script to replicate the issue, but these are unsupported filetypes! I can email these separately if you like, seems the easiest.

Expected behavior That the function executes the command without crashing! Or perhaps your function could test the input data and give a stop() and meaningful warning that the dataset is just too large, rather than killing R.

Desktop (please complete the following information):

Additional context Perhaps you could split the problem up and use multi-threading, might help solve it? All the tests are independent so could be run in parallel. I use doParallel in some of my R packages, its relatively simple and cross-platform.