Join results in more than 2^31 rows for format_sumstats

Snigireva commented 1 year ago

1. Bug description

Hi! I run this code to standardize the summary statistics:

data = fread('C:/Folder/trait_qc.sumstats.csv.gz')
reformatted <- MungeSumstats::format_sumstats(path=data,  ref_genome="GRCh37", compute_z = TRUE, return_data = TRUE)

Any idea of what to do with that?

Console output


Formatted summary statistics will be saved to ==>  C:\Users\P70~1\AppData\Local\Temp\RtmpQNDQJX\file371020c95a84.tsv.gz
Standardising column headers.
First line of summary statistics file: 
SNP CHR BP  PVAL    A1  A2  N   Z   BETA    SE  NSTUDY  
Summary statistics report:
   - 45,984,943 rows
   - 23,134,502 unique variants
   - 114,938 genome-wide significant variants (P<5e-8)
   - 22 chromosomes
Checking for multi-GWAS.
Checking for multiple RSIDs on one row.
Checking SNP RSIDs.
1,391,650 SNP IDs are not correctly formatted. These will be corrected from the reference genome.
Found  Indels. These won't be checked against the reference genome as it does not contain Indels.
WARNING If your sumstat doesn't contain Indels, set the indel param to FALSE & rerun MungeSumstats::format_sumstats()
Loading SNPlocs data.
Checking for merged allele column.
Checking A1 is uppercase
Checking A2 is uppercase
Ensuring all SNPs are on the reference genome.
Loading SNPlocs data.
Loading reference genome data.
Preprocessing RSIDs.
Validating RSIDs of 24,304,912 SNPs using BSgenome::snpsById...
BSgenome::snpsById done in 240 seconds.
Error in vecseq(f__, len__, if (allow.cartesian || notjoin || !anyDuplicated(f__,  : 
  Join results in more than 2^31 rows (internal vecseq reached physical limit). Very likely misspecified join. Check for duplicate key values in i each of which join to the same group in x over and over again. If that's ok, try by=.EACHI to run j for each group to avoid the large allocation. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and data.table issue tracker for advice.

Data (for the first 50 rows)

df = structure(list(SNP = c("rs367896724", "rs145", "rs534229142", "rs537182", "rs376342519", "rs5586", "rs575272151", "rs544419", "rs5611", "rs54", "rs62635286", "rs62", "rs53173", "rs538791886", "rs558318514", "rs55476", "rs574697788", "rs554", "rs546169444", "rs7", "rs54194", "rs6682385", "rs199856693", "rs3982632", "rs576", "rs2758118", "rs2758118", "rs53363", "rs564", "rs374", "rs2691317", "rs2691315", "rs5575142", "rs541172944", "rs548165136", "rs755466349", "rs539235482", "rs199745162", "rs578", "rs564", "rs533", "rs8", "rs545414834", "rs54", "rs532819925", "rs1", "rs5677884", "rs553572247", "rs539322794", "rs542415"), CHR = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), BP = c(10177L, 10352L, 10511L, 10539L, 10616L, 10642L, 11008L, 11012L, 11063L, 13110L, 13116L, 13118L, 13273L, 13289L, 13445L, 13483L, 13494L, 13550L, 14464L, 14599L, 14604L, 14930L, 14933L, 15211L, 15245L, 15274L, 15274L, 15585L, 15644L, 15774L, 15777L, 15820L, 15903L, 16071L, 16142L, 16226L, 16542L, 16949L, 17641L, 18643L, 18849L, 30923L, 46285L, 47159L, 47267L, 49298L, 49315L, 49343L, 49554L, 50891L ), PVAL = c(0.942, 0.682, 0.891, 0.393, 0.383, 0.297, 0.474, 0.474, 0.848, 0.729, 0.545, 0.545, 0.778, 0.0499, 0.109, 0.00465, 0.591, 0.0709, 0.643, 0.328, 0.328, 0.333, 0.901, 0.141, 0.116, 0.201, 0.259, 0.289, 0.689, 0.836, 0.35, 0.0248, 0.333, 0.565, 0.46, 0.497, 0.206, 0.595, 0.773, 0.197, 0.205, 0.684, 0.155, 0.69, 0.821, 0.311, 0.806, 0.745, 0.972, 0.394), A1 = c("AC", "TA", "A", "A", "CCGCCGTTGCAAAGGCGCGCCG", "A", "G", "G", "G", "A", "G", "G", "C", "C", "G", "C", "G", "A", "T", "A", "G", "A", "A", "T", "T", "A", "G", "A", "A", "A", "G", "T", "GC", "A", "A", "A", "A", "C", "A", "A", "C", "G", "A", "C", "G", "T", "A", "C", "G", "C"), A2 = c("A", "T", "G", "C", "C", "G", "C", "C", "T", "G", "T", "A", "G", "CCT", "C", "G", "A", "G", "A", "T", "A", "G", "G", "G", "C", "T", "A", "G", "G", "G", "A", "G", "G", "G", "G", "AG", "C", "A", "G", "G", "G", "T", "ATAT", "T", "T", "C", "T", "T", "A", "T"), N = c(8160L, 8160L, 361237L, 16026L, 372627L, 361266L, 8160L, 8160L, 357928L, 363969L, 8160L, 8160L, 3701L, 378761L, 357928L, 357928L, 358181L, 367239L, 6832L, 8160L, 8160L, 8160L, 358725L, 8160L, 362555L, 3701L, 3701L, 369481L, 362738L, 364049L, 362923L, 2373L, 8160L, 375575L, 367282L, 26547L, 357680L, 364788L, 357928L, 361989L, 368762L, 3701L, 359800L, 364512L, 361256L, 10040L, 362387L, 362834L, 6832L, 367281L), Z = c(0.0727563581760374, -0.409735480321281, 0.137038959961148, -0.854189500094597, 0.872382030909752, 1.04288836267464, -0.715985989610205, -0.715985989610205, 0.19167090224842, 0.346456061065837, -0.605269414941509, -0.605269414941509, 0.281926329587061, -1.96082020683793, 1.60270409055176, -2.83033010490082, 0.537387465090095, 1.80611742223106, -0.463508393356937, 0.978150286262472, 0.978150286262472, -0.968088845878538, -0.124398198069055, 1.47207731715937, 1.57178681650986, 1.27870772031991, 1.1287578451833, 1.06031789670761, 0.400212511707879, -0.207012623385187, -0.93458929107348, -2.24450387316539, 0.968088845878538, -0.575430768607773, -0.738846849185214, 0.679217595655219, 1.26464113566108, 0.531604424103706, 0.288453003564521, -1.29014591650869, -1.26743441691691, 0.407010876264466, -1.42209043212232, 0.398855065642337, -0.226258980439831, 1.01312595979589, 0.245589523422081, -0.325239256402395, 0.0351000017727088, 0.852385797957575), BETA = c(0.00198916, -0.0109805, 0.00765789, -0.149708, 0.0225852, 0.148159, -0.0281357, -0.028136, 0.103634, 0.00314893, -0.0212581, -0.0212581, 0.0161786, -0.0745136, 0.139501, -0.0774387, 0.0209628, 0.0577324, -0.0191033, 0.0330887, 0.0330901, -0.025562, -0.00126148, 0.0439155, 0.0906229, 0.0540921, 0.0478291, 0.0255675, 0.0135413, -0.00585945, -0.0164868, -0.119141, 0.0259418, -0.183099, -0.0257248, 0.0400081, 0.182568, 0.00773019, 0.0147548, -0.0327346, -0.0154651, 0.0315515, -0.0640722, 0.0034205, -0.0238865, 0.0309572, 0.0157055, -0.0169812, 0.00182556, 0.0274896), SE = c(0.0274895, 0.0268163, 0.0558682, 0.175335, 0.0258707, 0.141956, 0.0392787, 0.0392787, 0.542386, 0.00908721, 0.0351191, 0.0351191, 0.0574542, 0.0380054, 0.0869389, 0.0273598, 0.0389586, 0.0319694, 0.0412681, 0.0338204, 0.0338204, 0.0264114, 0.0100911, 0.0298549, 0.0576995, 0.0423158, 0.0423857, 0.0241328, 0.033891, 0.0282659, 0.0176259, 0.0530988, 0.0268215, 0.317943, 0.0348059, 0.0589221, 0.144412, 0.0145595, 0.0512095, 0.0253839, 0.0122108, 0.0776434, 0.0450702, 0.00857457, 0.105857, 0.0305461, 0.0639575, 0.0521867, 0.0527002, 0.0322444), NSTUDY = c(5L, 5L, 2L, 5L, 7L, 2L, 5L, 5L, 2L, 5L, 5L, 5L, 4L, 8L, 2L, 2L, 3L, 4L, 4L, 5L, 5L, 5L, 4L, 5L, 3L, 4L, 4L, 4L, 3L, 4L, 4L, 3L, 5L, 2L, 4L, 7L, 2L, 6L, 2L, 4L, 6L, 4L, 3L, 6L, 2L, 7L, 3L, 3L, 4L, 4L)), row.names = c(NA, -50L), class = c("data.table", "data.frame"))

3. Session info

R version 4.2.2 (2022-10-31 ucrt) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 19045)

Matrix products: default

locale: [1] LC_COLLATE=English_United States.utf8 [2] LC_CTYPE=English_United States.utf8
[3] LC_MONETARY=English_United States.utf8 [4] LC_NUMERIC=C
[5] LC_TIME=English_United States.utf8

attached base packages: [1] stats4 stats graphics grDevices utils datasets methods
[8] base

other attached packages: [1] GenomeInfoDb_1.34.9 IRanges_2.32.0 S4Vectors_0.36.2
[4] BiocGenerics_0.44.0 data.table_1.14.8

Al-Murphy commented 1 year ago

This looks like an issue with the sheer number of SNPs so I'm not sure working on a bigger RAM machine would help (but it might be worth a try) - to test options I would need the full summary statistics however. Can you share them?

Snigireva commented 1 year ago

I found a way to overcome this by dividing the sumstats into smaller tables and then formatting them separately (and then join back into one), but I just hoped that there is a more beautiful way to handle this

Al-Murphy commented 1 year ago

Yeah I guess you could inspect SNPs in chunks rather than checking all at once - this would be a good feature enhancement. You would need to work out the cut-off for the number of SNPs and the chunk size as there would be a time trade-off. If you would like to make a PR with code for this it would be much appreciated, I don't have time to actively work on a solution for this at the minute.

Al-Murphy / MungeSumstats