Closed plantformatics closed 1 year ago
The bone marrow dataset in our paper has ~900k cells and 200k cCREs and the code worked fine (maybe different computer/system has different limits?). Could you try subsetting your dataset to only keep the data of, say, 1000 cells? If there's still an error then it's probably not an file size issue. Besides, could you check whether the file format is correct (i.e. tab delimited frags file). Finally, if you can show me exactly which line of code threw the error it might be easier for us to troubleshoot
Thanks for the quick response. I agree its probably not a data size problem. I'm working on an HPC with 1Tb of memory and 60 CPU ~ I shouldn't be limited by these resources. The exact line of code that gave an error was this one (lines 22-23 of getCounts.R):
frags <- data.table::fread(fragFile, sep = "\t", showProgress = TRUE, nrows=nrows) %>% data.frame()
I will try subsetting the data first (which means I need to redo the pseudobulking for the groupInfo file). The fragment file is correctly formatted (although it is compressed with bgzip). I put an example of the first 5 lines below (columns are "\t" delimited). Im pretty sure fread
can handle gzipped files natively. Thank you for your help on this.
chr1 65 247 AATACGCAGAGACTCG-pool11 1 chr1 69 148 ATTCGTTGTTAGGCTT-pool7 1 chr1 72 1316 TGGCCTTGTCCGTCGA-pool11 1 chr1 72 1324 GAAGAGCCAGATTGTC-pool16 1 chr1 78 1337 ACTAACGCAACGGGTA-pool11 1
Yeah that sounds good! Keep me posted : )
I was able to load a subset of 1 million fragments without issue. My fragment file is 101Gb uncompressed, and contains over 2.2 billon unique (cellular) fragments. I tried loading the uncompressed tsv and still get the same error. I suspect the error is coming from the piped call to data.frame()
, for what it's worth. I will try to load without piping to see what happens. If needed, I can filter the fragments file for my barcode list and max fragment size ahead of time. Stay tuned.
ok so if you can load 1 million fragments then your data format should be compatible with the code. Probably having a size issue here. 101G is definitely pretty big and our data is 12G when compressed. We've also ran into similar issues when running on very big datasets and if I remember correctly this is an general issue with how R handles large data. One thing that we tried was chunking the data into smaller subsets (let's say you have 100 pseudobulks, then maybe chunk the fragments file into 10 subfiles each corresponding to 10 pseudobulks). A more elegant solution would probably come later when the python version is finished.
I was able to get the fragments file down to 14G compressed, but I am wondering if I can trim it more by selecting fragments under a certain length. Is there a hard cut-off on the size of fragments that PRINT uses for footprinting?
That's a great question! So PRINT uses the two ends of each fragment (which is essentially the cutting sites) for footprinting. Therefore, we think fragments of all length could be informative. We currently don't set a cutoff on the size of the fragment.
Good news: I was able to load my barcode-filtered fragments file. I ended up with a total 1,346,488,605 unique fragments. Thanks again for your help and for making this tool!
No problem at all! : )
I am getting an error loading the fragment file following the completion of the progress bar:
Error in dim.data.table(x) : long vectors not supported yet: ../../src/include/Rinlinedfuns.h:522 In addition: Warning message: In setattr(ans, "row.names", .set_row_names(nr)) : NAs introduced by coercion to integer range
I have a large number of cells (~700K) and many cCRE regions (filtered down to ~82K). The "long vector" error makes me think that the data set too large to load into memory. Are there ways around this?