Open crahal opened 3 months ago
What's the dimensionality of 'y' in this case?
~700k or so
There's an expand.grid
of 1:2750 against 1:700k, and this is likely causing the overflow. I'll ponder a workaround and run some tests on this case. (So far, we've only tested merges of dimensionality ~100k.) More soon.
How detrimental to linkage performance would it be to iterate through chunks of 1k 'x' at a time? Is any of the training holistic, or are all of the linkages one-shot?
Linkages are one-shot, so iterating through chunks in the way described should give the same results (with one qualification being that the choice of acceptable match threshold might be dynamically set given input data; to disable that, one can set AveMatchNumberPerAlias = NULL
and set MaxDist = c
for some floating point constant c
.
In general, it's hard to know what that c
should be but looking at a histogram of distances between matches/non-matched points if available can help.
You might also want to check out ZoomerJoin
for a big matching task like this (it's specifically designed for very large merge tasks and computes matches (approximately) using locality sensitive hashing). Ben (of ZoomerJoin) and I are in the process of adding ZoomerJoin capabilities to LinkOrgs, but in the meantime it wouldn't be too hard to output the machine learned representations of the organizational aliases that could then be fed into, e.g., ZoomerJoin.
Whenever my 'x' is greather than ~2750 organisations, I get this error (on all different models):
Again in windows, R 3.4.