Open Snigireva opened 1 year ago
This looks like an issue with the sheer number of SNPs so I'm not sure working on a bigger RAM machine would help (but it might be worth a try) - to test options I would need the full summary statistics however. Can you share them?
I found a way to overcome this by dividing the sumstats into smaller tables and then formatting them separately (and then join back into one), but I just hoped that there is a more beautiful way to handle this
Yeah I guess you could inspect SNPs in chunks rather than checking all at once - this would be a good feature enhancement. You would need to work out the cut-off for the number of SNPs and the chunk size as there would be a time trade-off. If you would like to make a PR with code for this it would be much appreciated, I don't have time to actively work on a solution for this at the minute.
1. Bug description
Hi! I run this code to standardize the summary statistics:
Any idea of what to do with that?
Console output
Data (for the first 50 rows)
df = structure(list(SNP = c("rs367896724", "rs145", "rs534229142", "rs537182", "rs376342519", "rs5586", "rs575272151", "rs544419", "rs5611", "rs54", "rs62635286", "rs62", "rs53173", "rs538791886", "rs558318514", "rs55476", "rs574697788", "rs554", "rs546169444", "rs7", "rs54194", "rs6682385", "rs199856693", "rs3982632", "rs576", "rs2758118", "rs2758118", "rs53363", "rs564", "rs374", "rs2691317", "rs2691315", "rs5575142", "rs541172944", "rs548165136", "rs755466349", "rs539235482", "rs199745162", "rs578", "rs564", "rs533", "rs8", "rs545414834", "rs54", "rs532819925", "rs1", "rs5677884", "rs553572247", "rs539322794", "rs542415"), CHR = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), BP = c(10177L, 10352L, 10511L, 10539L, 10616L, 10642L, 11008L, 11012L, 11063L, 13110L, 13116L, 13118L, 13273L, 13289L, 13445L, 13483L, 13494L, 13550L, 14464L, 14599L, 14604L, 14930L, 14933L, 15211L, 15245L, 15274L, 15274L, 15585L, 15644L, 15774L, 15777L, 15820L, 15903L, 16071L, 16142L, 16226L, 16542L, 16949L, 17641L, 18643L, 18849L, 30923L, 46285L, 47159L, 47267L, 49298L, 49315L, 49343L, 49554L, 50891L ), PVAL = c(0.942, 0.682, 0.891, 0.393, 0.383, 0.297, 0.474, 0.474, 0.848, 0.729, 0.545, 0.545, 0.778, 0.0499, 0.109, 0.00465, 0.591, 0.0709, 0.643, 0.328, 0.328, 0.333, 0.901, 0.141, 0.116, 0.201, 0.259, 0.289, 0.689, 0.836, 0.35, 0.0248, 0.333, 0.565, 0.46, 0.497, 0.206, 0.595, 0.773, 0.197, 0.205, 0.684, 0.155, 0.69, 0.821, 0.311, 0.806, 0.745, 0.972, 0.394), A1 = c("AC", "TA", "A", "A", "CCGCCGTTGCAAAGGCGCGCCG", "A", "G", "G", "G", "A", "G", "G", "C", "C", "G", "C", "G", "A", "T", "A", "G", "A", "A", "T", "T", "A", "G", "A", "A", "A", "G", "T", "GC", "A", "A", "A", "A", "C", "A", "A", "C", "G", "A", "C", "G", "T", "A", "C", "G", "C"), A2 = c("A", "T", "G", "C", "C", "G", "C", "C", "T", "G", "T", "A", "G", "CCT", "C", "G", "A", "G", "A", "T", "A", "G", "G", "G", "C", "T", "A", "G", "G", "G", "A", "G", "G", "G", "G", "AG", "C", "A", "G", "G", "G", "T", "ATAT", "T", "T", "C", "T", "T", "A", "T"), N = c(8160L, 8160L, 361237L, 16026L, 372627L, 361266L, 8160L, 8160L, 357928L, 363969L, 8160L, 8160L, 3701L, 378761L, 357928L, 357928L, 358181L, 367239L, 6832L, 8160L, 8160L, 8160L, 358725L, 8160L, 362555L, 3701L, 3701L, 369481L, 362738L, 364049L, 362923L, 2373L, 8160L, 375575L, 367282L, 26547L, 357680L, 364788L, 357928L, 361989L, 368762L, 3701L, 359800L, 364512L, 361256L, 10040L, 362387L, 362834L, 6832L, 367281L), Z = c(0.0727563581760374, -0.409735480321281, 0.137038959961148, -0.854189500094597, 0.872382030909752, 1.04288836267464, -0.715985989610205, -0.715985989610205, 0.19167090224842, 0.346456061065837, -0.605269414941509, -0.605269414941509, 0.281926329587061, -1.96082020683793, 1.60270409055176, -2.83033010490082, 0.537387465090095, 1.80611742223106, -0.463508393356937, 0.978150286262472, 0.978150286262472, -0.968088845878538, -0.124398198069055, 1.47207731715937, 1.57178681650986, 1.27870772031991, 1.1287578451833, 1.06031789670761, 0.400212511707879, -0.207012623385187, -0.93458929107348, -2.24450387316539, 0.968088845878538, -0.575430768607773, -0.738846849185214, 0.679217595655219, 1.26464113566108, 0.531604424103706, 0.288453003564521, -1.29014591650869, -1.26743441691691, 0.407010876264466, -1.42209043212232, 0.398855065642337, -0.226258980439831, 1.01312595979589, 0.245589523422081, -0.325239256402395, 0.0351000017727088, 0.852385797957575), BETA = c(0.00198916, -0.0109805, 0.00765789, -0.149708, 0.0225852, 0.148159, -0.0281357, -0.028136, 0.103634, 0.00314893, -0.0212581, -0.0212581, 0.0161786, -0.0745136, 0.139501, -0.0774387, 0.0209628, 0.0577324, -0.0191033, 0.0330887, 0.0330901, -0.025562, -0.00126148, 0.0439155, 0.0906229, 0.0540921, 0.0478291, 0.0255675, 0.0135413, -0.00585945, -0.0164868, -0.119141, 0.0259418, -0.183099, -0.0257248, 0.0400081, 0.182568, 0.00773019, 0.0147548, -0.0327346, -0.0154651, 0.0315515, -0.0640722, 0.0034205, -0.0238865, 0.0309572, 0.0157055, -0.0169812, 0.00182556, 0.0274896), SE = c(0.0274895, 0.0268163, 0.0558682, 0.175335, 0.0258707, 0.141956, 0.0392787, 0.0392787, 0.542386, 0.00908721, 0.0351191, 0.0351191, 0.0574542, 0.0380054, 0.0869389, 0.0273598, 0.0389586, 0.0319694, 0.0412681, 0.0338204, 0.0338204, 0.0264114, 0.0100911, 0.0298549, 0.0576995, 0.0423158, 0.0423857, 0.0241328, 0.033891, 0.0282659, 0.0176259, 0.0530988, 0.0268215, 0.317943, 0.0348059, 0.0589221, 0.144412, 0.0145595, 0.0512095, 0.0253839, 0.0122108, 0.0776434, 0.0450702, 0.00857457, 0.105857, 0.0305461, 0.0639575, 0.0521867, 0.0527002, 0.0322444), NSTUDY = c(5L, 5L, 2L, 5L, 7L, 2L, 5L, 5L, 2L, 5L, 5L, 5L, 4L, 8L, 2L, 2L, 3L, 4L, 4L, 5L, 5L, 5L, 4L, 5L, 3L, 4L, 4L, 4L, 3L, 4L, 4L, 3L, 5L, 2L, 4L, 7L, 2L, 6L, 2L, 4L, 6L, 4L, 3L, 6L, 2L, 7L, 3L, 3L, 4L, 4L)), row.names = c(NA, -50L), class = c("data.table", "data.frame"))
3. Session info
R version 4.2.2 (2022-10-31 ucrt) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 19045)
Matrix products: default
locale: [1] LC_COLLATE=English_United States.utf8 [2] LC_CTYPE=English_United States.utf8
[3] LC_MONETARY=English_United States.utf8 [4] LC_NUMERIC=C
[5] LC_TIME=English_United States.utf8
attached base packages: [1] stats4 stats graphics grDevices utils datasets methods
[8] base
other attached packages: [1] GenomeInfoDb_1.34.9 IRanges_2.32.0 S4Vectors_0.36.2
[4] BiocGenerics_0.44.0 data.table_1.14.8