caitiecollins / treeWAS

treeWAS: A Phylogenetic Tree-Based Tool for Genome-Wide Association Studies in Microbes
Other
92 stars 18 forks source link

long vectors not supported yet #72

Open daisy238 opened 5 months ago

daisy238 commented 5 months ago

Hi Caitlin,

Thanks for developing TreeWAS!

I'm trying to use unitigs with TreeWAS, and have been running into the below error:

Error in unlist(snps[!is.na(snps)]) : 
  long vectors not supported yet: ../../src/include/Rinlinedfuns.h:537
Calls: treeWAS -> unique -> as.vector -> unlist
Execution halted

I've successfully already run TreeWAS on a smaller gene absence presence dataset so to me this looks like a memory based issue. I've therefore added the mem.lim parameter but I still receive the same error. I'm running the job on the cluster with 925GB. The unitigs file is around 27GB, with 2806 genomes and 5682556 unitigs/columns.

My TreeWAS commands are below:

unitigs <- treeWAS(snps = unitig_matrix,
                   phen = phenotypes,
                   tree = data_tree,
                   mem.lim = 900,
                   seed = 1)

I've also tried using mem.lim = TRUE but this has given me the same error.

If I reduce the number of columns in the unitig matrix down to 1000, TreeWAS then works.

Do you have any advice please, for dealing with a large unitig matrix?

caitiecollins commented 5 months ago

Hi Daisy,

I just pushed a change that should resolve the current issue you're facing. So if you re-download and install the treeWAS package (with dependencies=TRUE) from GitHub, it should work without hitting that error.

The line causing the error runs before the lines that are adjusted by the memory limit setting (which then subdivides your snps data into chunks to make it more manageable), which is why it wasn't affected by the mem.lim parameter, unfortunately.

That's a mighty large dataset you're working with (gotta love unitigs), so you may run into other issues. If you do, please let me know and I'll try to get back to you quicker with a fix. I'm keen to make the package more scalable.

Best, Caitlin.

daisy238 commented 5 months ago

Hi Caitlin,

Thanks for looking into this and making the change. This has prevented the previous vector issue but we are now encountering another issue:

Error in `dplyr::n_distinct()`:
! Can't recycle `..1` (size 1605613362) to size 1605613362.
Backtrace:
    ▆
 1. ├─treeWAS::treeWAS(...)
 2. │ └─dplyr::n_distinct(snps[!is.na(snps)])
 3. │   └─vctrs::vec_recycle_common(!!!args, .size = size)
 4. └─vctrs:::stop_recycle_incompatible_size(...)
 5.   └─vctrs:::stop_vctrs(...)
 6.     └─rlang::abort(message, class = c(class, "vctrs_error"), ..., call = vctrs_error_call(call))
Execution halted

On another note, I tried to remove the following lines from the previous treeWAS.R code, to get around the long vectors issue. With 900GB memory this split the unitig matrix into 83 chunks, however each chunk took around 24 hours to process. Is this to be expected?

portion of treeWAS.R code removed:

 ## CHECK IF BINARY:
 if(length(unique(as.vector(unlist(snps[!is.na(snps)])))) != 2){
    stop("snps must be a binary matrix")
caitiecollins commented 4 months ago

Hi Daisy,

That sounds like far longer per chunk than I would expect. It sounds like you may be bumping up against some memory constraints still, which could be slowing it down.

I would suggest trying to run a larger number of smaller chunks. Typically this doesn't actually take longer than running fewer larger chunks, and it may help if you're still approaching any unseen memory limits within each chunk.

Try setting chunk.size=10000. I just ran a toy example with 2806 rows and 10000 columns on my laptop and it finished in 8.5 minutes. (And only ~8.9 minutes whether I subdivided that into 2, 5, or 10 chunks). So with your 5682556 columns, at that rate with chunk.size=10000 it could finish in under 4 days.