cjerzak / LinkOrgs-software

LinkOrgs: An R package for linking linking records on organizations using half a billion open-collaborated records from LinkedIn
https://arxiv.org/abs/2302.02533v3
MIT License
11 stars 1 forks source link

integer overflow #6

Open crahal opened 2 weeks ago

crahal commented 2 weeks ago

Whenever my 'x' is greather than ~2750 organisations, I get this error (on all different models):

Error in if (machine == "localhost") "localhost" else getClusterOption("master",  : 
  missing value where TRUE/FALSE needed
In addition: Warning message:
In nrow(x) * nrow(y) : NAs produced by integer overflow

Again in windows, R 3.4.

cjerzak commented 2 weeks ago

What's the dimensionality of 'y' in this case?

crahal commented 2 weeks ago

~700k or so

cjerzak commented 2 weeks ago

There's an expand.grid of 1:2750 against 1:700k, and this is likely causing the overflow. I'll ponder a workaround and run some tests on this case. (So far, we've only tested merges of dimensionality ~100k.) More soon.

crahal commented 2 weeks ago

How detrimental to linkage performance would it be to iterate through chunks of 1k 'x' at a time? Is any of the training holistic, or are all of the linkages one-shot?

cjerzak commented 2 weeks ago

Linkages are one-shot, so iterating through chunks in the way described should give the same results (with one qualification being that the choice of acceptable match threshold might be dynamically set given input data; to disable that, one can set AveMatchNumberPerAlias = NULL and set MaxDist = c for some floating point constant c.

In general, it's hard to know what that c should be but looking at a histogram of distances between matches/non-matched points if available can help.

You might also want to check out ZoomerJoin for a big matching task like this (it's specifically designed for very large merge tasks and computes matches (approximately) using locality sensitive hashing). Ben (of ZoomerJoin) and I are in the process of adding ZoomerJoin capabilities to LinkOrgs, but in the meantime it wouldn't be too hard to output the machine learned representations of the organizational aliases that could then be fed into, e.g., ZoomerJoin.