bnosac / doc2vec

Distributed Representations of Sentences and Documents
Other
46 stars 5 forks source link

Possible Memory Leak in top2vec #19

Closed mlinegar closed 2 years ago

mlinegar commented 3 years ago

I have been running doc2vec and top2vec on a Unix server. However, as I increase the data size I ran into the following error:

caught segfault address 0x7ef3c55842c8, cause 'memory not mapped'

This is happening when calling the following top2vec code:

t2v <- top2vec(d2v, control.dbscan = list(minPts = 25), control.umap = list(n_neighbors = 100L, n_components = 2, metric = "cosine"), umap = tumap, trace = FALSE)

Which refers to the following doc2vec line (note I have also tried using doc2vec models with fewer dimensions (50) and iterations (25)):

d2v <- paragraph2vec(x = sample_text, type = "PV-DBOW", dim = 100, iter = 50, min_count = 10, lr = 0.05, threads = 6)

This only happens when running t2v on my "full" data, which has 137649 rows, 4 columns (doc_id, date, origin, text) and takes up around 200mb. When running on a subset of the data (20% sample), I do not run into this error. With the full data, doc2vec runs correctly; the issue is only with top2vec.

This seems to happen regardless of the options I specify for t2v (I've tried with different combinations of minPts, n_neighbors, and n_components). I've also tried increasing the amount of RAM. With this same dataset, I've tried using as much as 600GB at a time, with the same error.

I am happy to provide any other information that may be useful, and can email the data itself if that would be helpful.

Here is the traceback:

Traceback: 1: mrd(xdist, core_dist) 2: (function (x, minPts, gen_hdbscan_tree = FALSE, gen_simplified_tree = FALSE) { if (.matrixlike(x) && !inherits(x, "dist")) { x <- as.matrix(x) if (!is.numeric(x)) stop("hdbscan expects numerical data") xdist <- dist(x, method = "euclidean") } else if (inherits(x, "dist")) { xdist <- x } else { stop("hdbscan expects a matrix-coercible object of numerical data, and xdist to be a 'dist' object (or not supplied).") } core_dist <- kNNdist(x, k = minPts - 1) n <- attr(xdist, "Size") mrd <- mrd(xdist, core_dist) mst <- prims(mrd, n) hc <- hclustMergeOrder(mst, order(mst[, 3])) hc$call <- match.call() res <- computeStability(hc, minPts, compute_glosh = TRUE) res <- extractUnsupervised(res) cl <- attr(res, "cluster") sl <- attr(res, "salient_clusters") prob <- rep(0, length(cl)) for (cid in sl) { ccl <- res[[as.character(cid)]] max_f <- max(core_dist[which(cl == cid)]) pr <- (max_f - core_dist[which(cl == cid)])/max_f prob[cl == cid] <- pr } if (any(cl == 0)) { cluster <- match(cl, c(0, sl)) - 1 } else { cluster <- match(cl, sl) } cl_map <- structure(sl, names = unique(cluster[hc$order][cluster[hc$order] != 0])) cluster_scores <- sapply(sl, function(sl_cid) res[[as.character(sl_cid)]]$stability) names(cluster_scores) <- names(cl_map) attr(res, "cl_map") <- cl_map out <- structure(list(cluster = cluster, minPts = minPts, cluster_scores = cluster_scores, membership_prob = prob, outlier_scores = attr(res, "glosh"), hc = hc), class = "hdbscan", hdbscan = res) if (gen_hdbscan_tree) { out$hdbscan_tree = buildDendrogram(hc) } if (gen_simplified_tree) { out$simplified_tree = simplifiedTree(res) } return(out)})(minPts = 25, x = c(-0.0329070008023855, -0.0510561382993338, 0.31927777168907, 2.53887701866783, -0.197800866387713, -0.0769658005460379, -0.141690722726214, 0.215857990957868, -0.170577040933001, 0.418069847799909, -0.367248765252459, -0.183592072747576, -3.19788619043671, -0.377654544137346, 1.30107212898887, 0.265027531363142, -0.136775246880877, 0.636537083365095, 0.761293419577253, -0.307090512536394, 0.400113590933454, 0.612990864493025, -3.08819507720314, 1.46403838036217, 0.912720688559187, -0.17281960608803, -1.50698255660378, -0.165271273873675, 0.103059776999128, 3.35260630486168, -0.259427062295305, -0.0712256348355886, 0.125617989279402, -0.27269434096657, -0.398004523537981, 0.534049042441023, -3.35195884766422, -0.436888686440813, 0.530528553702009, 0.447638043142927, -0.156622878335344, 0.605052956320417, 0.509713658072126, -4.14617001178108, -0.241759053490984, -0.243766061089861, 0.715384491659773, 1.73730564949669, -0.216015807412493, -1.56626986625038, -0.520728579781878, -3.89345022203766, -2.38994168402992, -0.457976571343767, -4.31358020665489, -2.2884445107206, -0.439960948251116, -4.19196390273415, -2.454303852342, -0.342407695077288, -4.55869232776009, -2.29146622779213, -0.0608339226468679, -0.498709193490374, -0.208178273461687, 0.0719955050722483, -3.76645163359486, -2.34971570136391, -2.45159172179543, -0.011860362313616, -0.380645505212176, -4.75415836932503, -0.243525973580706, 0.257605084158552, 0.0955443465486887, 2.65294791100181, 4.32672406075157, -0.166711322091448, 4.23832465050377, -1.14851283194863, 2.85808659432091, 2.91992665169395, 4.11696768639244, -2.65539663197838, -0.684868565820086, 0.0440962397829416, 4.12063742516197, 3.36015892861046, 2.63998938439049, 2.25346947548546, 3.19387198326744, 2.07401228783287, -0.316493502877581, -0.262502185128557, 1.12873745796837, 0.292615183569563, -0.332694760583269, 0.148568876959455, 0.171530255056989, -0.321022263787615, 0.587905415274275, 0.266159542776716, 2.31162405846275, 2.15795899269737, 0.0531840407625559, 1.00326491234459, -0.444835892938006, -0.517042628548968, 0.0712685668245676, -0.228896609567034, -0.433341017983782, -4.83604668738686, -4.84771465422951, -4.82384740474068, -4.73267149092995, -4.78702329757057, -0.184857121728289, 0.137701758124006, -0.323441497109759, -0.196935406945574, -0.514315358422625, -0.169835797570574, -2.27167653205239, -0.291279307626116, 1.14935017464317, -0.281418076775896, -2.83764975430809, -0.0430350220426199, 3.23981524346031, -0.343746892236101, 0.222118862845075, -0.239912024758684, 0.616745957113874, -0.307937852166521, -0.130656472466814, 0.129127272345197, -2.13983034255348, -0.236182442925799, 0.202927597738874, -0.236794940255511, -0.0531713879331228, -0.311888924859392, -0.391226521752703, 1.88042403099693, 0.366949566580427, 0.286372193075788, -0.11731742980324, -0.166264048837053, 0.63383150932945, -0.0246810829862234, -0.0281255161985037, 0.106927164770735, -0.27637242438637, -0.155985823891985, -1.22118734481178, -0.0537018692716238, -0.235035172723162, 0.78052044747032, -2.10444699885689, -0.229012242577898, 1.9815955245272, -0.157938710473406, -0.0347750103696463, 1.24469519493736, 0.551852711416853, 1.55943013069786, -1.31541370513283, 2.28270531532921, -3.14784016373001, 1.65187264321007, -0.344153395913469, 0.242526539541853, 1.43331766960777, -0.167527667306292, 3.05519915459312, 2.07082844612755, -0.391653291009295, 1.16398812172569, 0.522867211081159, -0.118953696511614, -0.857514849923479, 0.707190998770368, 0.0620434367433909, 0.929068573691023, 1.31006146309532, 0.465960034109724, 1.88661862251915, -1.43217562796913, 2.24886370537437, 1.55984593269981, -0.350220910333025, 0.0823924624697092, 3.17227221367515, 2.22453976509727, -0.298243752740252, -3.44783330621563, 1.25892639992393, 0.612939842917097, -0.932937852166521, 0.751771458365095, -0.0796437180265066, 0.0695803248659494, 0.619783886648786, 0.607982643820417, -0.322912446282732, -0.217329493783343, -0.500463238976824, -5.04296158912026, -4.81553637149178, 0.248554238058698, -4.42120861174904, -4.82453464629494, -0.383761397622454, -0.0846614754422781, -1.07703899504982, -0.210983029626238, -0.412444821618426, -1.29818033339821, -0.170177928231585, -1.98086117865883, 0.324098356939924, 1.63329077599205, 1.98234892723717, -0.623602858804094, 0.145401962973249, -0.286433926843035, -0.774450055383074, -0.21771501662575, -0.133796206735003, 0.919029244162214, -0.278431168816912, -0.977221480630266, -0.125824919961321, 0.0422182166353586, 0.0604941928163889, -0.613816491387713, -0.411679021142351, -0.661084643624651, -0.0911018765195486, -0.190765134118426, -0.212877026818621, -1.18815993430458, -0.25946568610512, 0.170583256460798, 0.283740528799665, -6.09404563071572, 2.97684813378013, 2.27240801689781, 2.55227471230186, -1.82800804736458, 2.43285418389, 2.78031254646934, 2.74097586510338, 3.01722336647667, 2.08739567635216, 3.56992293236412, 4.08850003121055, 0.689634808279646, -0.471400252602923, 2.91557122109093, 3.34113264916099, 3.50812483666099, 3.15320206520714, -1.30213593604408, -0.378706208489763, -0.649160376809466, 2.46073771355308, -0.245192519448626, -1.72508417727791, -4.76835965277992, -0.256590834878313, -0.252418748162615, 2.82091284630455, -0.233392707131731, -4.74632035853707, -0.186841956399309, -0.344854108117449, 3.06035519478477, -4.57136147382103, -5.16596185328804, -1.29047607543312, -0.226802340768206, -4.7419174826368, 0.176206835486067, -1.5842163479551, 0.0854582869783762, -0.0367193138822195, 1.53171587822593, 2.11590720055259, -4.84130679728829, -0.451010457299578, -0.00751732947670192, -0.160587540887224, -1.60486053588234, -0.294502726815569, -4.78683685424172, -4.22128152015053, -1.12536167266213, -0.3083362496122, 0.740309246756208, -4.78091870906197, -0.487423173211443, 2.79731036064781, -4.51877360703789, -4.71318780543648, -0.277513734124529, -0.347368947289812, -0.376001826547014, -3.05268930556618, -0.0252311146482107, -2.91930716397606, -4.73591219546639, -0.355184308312761, -0.861971846841204, 0.427985199667585, -0.298479548714983, 1.09996558067955, -0.245578519128191, -1.43292998435341, 0.108910568930281, 0.336952694632185, -0.209662429116594, 0.669828423239363, 2.31232977745689, -1.22503518225991, 2.11726237175621, -0.412832490228045, 0.473070152975691, -0.578250399850237, -4.75037514331185, -0.196096173547136, -0.130308142922747, 1.36480570671715, -0.272620192788469, 3.51678849098839, -0.269890061639178, -0.274938813470232, -0.476217023156511, -0.652882090829241, -0.0501427567227957, -4.7687762892469, -0.448002568505633, -1.66389798285805, -3.66359101521217, -2.33136927249275, -4.7909817612394, -2.53429614665352, -0.942775956414568, -0.399054042123186, -2.44937121035897, -3.29428999783837, -4.81725525023781, -1.55176662566506, -1.55379437568032, -0.212881556771624, -0.125969401620257, 0.964581497885359, 0.0238509261385325, 2.79596520302452, 1.19639159081138, 1.11229992745079, -0.34817098738991, -0.423876515649187, -0.373279324792254, -0.249542228005755, -0.379079333566057, -0.274327985070574, -0.343841305993426, -1.11404203536354, -0.194974652551043, 0.577235707022322, -0.00770377280555934, 0.7233700835482, -0.101436606667864, -0.416516534112322, -0.361040583871233, 1.1137189948336, 2.98849154350914, 0.743658074118269, 0.0169775569216135, -0.46082591178261, 0.390957840658796, -0.374566308282244, -0.421959630273211, -0.466870061181414, -0.191645852349627, -0.362447730325091, 1.14317704079307, -0.372092238686907, 0.371094473578108, 0.00698710320152074, 1.04684258339561, 0.146469601370466, -0.323873511575091, 0.409598358847273, -0.360558024667132, -0.081912270806658, 2.95470524666466, 3.53493262169517, -0.393639317773211, 0.459333428122175, 0.907476433493269, 0.11545205948509, -0.264226189874041, -0.294573537133562, -0.294564238808977, -0.285635939858782, -0.159150830529558, -0.421795836709368, 3.44882679817833, -0.0787765896543142, -0.0108737862332937, 0.180952318884504, 1.02876473305382, 0.00838995812095433, -0.829011908791887, -4.75659834506355, -0.242485753320086, -0.746758929513323, -2.75293116929375, -0.609251490853655, -0.0299913799985525, -3.28620650532089, -0.431553832314837, 0.630290516592634, 0.512053021170271, 0.718517311789167, 0.328461416937483, 0.611931332327497, -0.106153002999651, -1.54916476371132, 2.51386023399986, 2.6255130851046, 2.91035128472008, 2.52646637795128, 1.97924519417442, 2.35376978752769, 1.198638924338, 1.80682755348839, 3.0093956076876, -1.94178520800911, 2.222245224692, -4.35300349356972, -0.737226954720843, 4.25545073387779, 2.45605135796226, 2.8155479514376, 1.03238392708458, 3.80362416145958, 2.93301487801231, 4.50513554451622, -0.435025683663714, -0.253192893288958, -0.212964049599993, -0.411410800240862, -0.172416440270769, 0.535430916525495, -0.273459903024065, -0.414951077721941, 0.121569403387678, -0.455059281609881, -0.264105311654436, -0.470468274377215, 1.65037728188194, -0.383880368493426, -0.302979937814104, -0.419511071465838, -0.138471118234026, -4.69355951907479, -0.257550946496355, 0.513857372976911, 0.595844753958357, -0.171005002282488, -0.522448531411516, -0.19066309096657, 1.93... 3: do.call(dbscan::hdbscan, control.dbscan) 4: top2vec(d2v, control.dbscan = list(minPts = 25), control.umap = list(n_neighbors = 100L, n_components = 2, metric = "cosine"), umap = tumap, trace = FALSE) An irrecoverable exception occurred. R is aborting now ...

Here is the output of sessionInfo():

Matrix products: default BLAS: /software/free/R/R-4.0.0/lib/R/lib/libRblas.so LAPACK: /software/free/R/R-4.0.0/lib/R/lib/libRlapack.so

Random number generation: RNG: L'Ecuyer-CMRG Normal: Inversion Sample: Rejection

locale: [1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
[4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
[7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C

attached base packages: [1] stats graphics grDevices utils datasets methods base

other attached packages: [1] ranger_0.12.1 vctrs_0.3.7 rlang_0.4.10
[4] mosaicCore_0.9.0 yardstick_0.0.8 workflowsets_0.0.2 [7] workflows_0.2.2 tune_0.1.5 tidyr_1.1.3
[10] tibble_3.1.1 rsample_0.0.9 recipes_0.1.16
[13] purrr_0.3.4 parsnip_0.1.5 modeldata_0.1.0
[16] infer_0.5.4 ggplot2_3.3.3 dplyr_1.0.5
[19] dials_0.0.9 scales_1.1.1 broom_0.7.6
[22] tidymodels_0.1.3 lubridate_1.7.10 gsubfn_0.7
[25] proto_1.0.0 data.table_1.13.6 dbscan_1.1-8
[28] uwot_0.1.10 Matrix_1.3-2 stringr_1.4.0
[31] doc2vec_0.2.0 futile.logger_1.4.3

loaded via a namespace (and not attached): [1] splines_4.0.0 foreach_1.5.1 here_0.1
[4] prodlim_2019.11.13 assertthat_0.2.1 conflicted_1.0.4
[7] GPfit_1.0-8 globals_0.14.0 ipred_0.9-11
[10] pillar_1.6.0 backports_1.2.0 lattice_0.20-41
[13] glue_1.4.2 pROC_1.17.0.1 digest_0.6.27
[16] pryr_0.1.4 hardhat_0.1.5 colorspace_2.0-0
[19] plyr_1.8.6 timeDate_3043.102 pkgconfig_2.0.3
[22] lhs_1.1.1 DiceDesign_1.9 listenv_0.8.0
[25] RSpectra_0.16-0 gower_0.2.2 lava_1.6.9
[28] generics_0.1.0 ellipsis_0.3.1 withr_2.3.0
[31] furrr_0.2.2 nnet_7.3-14 cli_2.4.0
[34] survival_3.2-7 magrittr_1.5 crayon_1.3.4
[37] memoise_1.1.0 ps_1.4.0 fansi_0.4.1
[40] future_1.21.0 parallelly_1.24.0 MASS_7.3-53
[43] class_7.3-17 tools_4.0.0 formatR_1.7
[46] lifecycle_1.0.0 munsell_0.5.0 lambda.r_1.2.4
[49] compiler_4.0.0 grid_4.0.0 rstudioapi_0.13
[52] iterators_1.0.13 RcppAnnoy_0.0.18 gtable_0.3.0
[55] codetools_0.2-18 DBI_1.1.0 R6_2.5.0
[58] utf8_1.1.4 rprojroot_1.3-2 futile.options_1.0.1 [61] stringi_1.5.3 parallel_4.0.0 Rcpp_1.0.6
[64] rpart_4.1-15 tidyselect_1.1.0

jwijffels commented 3 years ago

Thanks for the report. Function top2vec uses dbscan::hdbscan to cluster the result of the embedding from UMAP (which is a document embedding space with for your example 2 dimensions.

The error is coming fromdbscan::hdbscan. Maybe this should be asked at this repository (https://github.com/mhahsler/dbscan) In order to debug this, you will need to provide the matrix which is passed on to hdbscan and report to that repository. To get it, you can do the following to reproduce the error and provide embedding_umap to the authors at https://github.com/mhahsler/dbscan

emb <- as.matrix(d2v)
embedding_umap   <- uwot::tumap(emb , n_neighbors = 100L, n_components = 2, metric = "cosine")
thisshouldfail <- dbscan::hdbscan(embedding_umap , minPts = 25)
jwijffels commented 2 years ago

Closing as this is coming from dbscan::hdbscan and followed up at https://github.com/mhahsler/dbscan/issues/46