I have a dataset that has is very sparse. That is, it has multiple null fields and multiple variations of the same entity.
Essentially,
FN, LN, field1, field2, ... , fieldk, ... fieldN
filled, filled, null, null, ..., Value, Null, ... Null (This is entity 1)
filled maybe typo or more info than filled from above, filled, null, ..., Value (with typo), (maybe this one is filled), ... Null (This is same entity as 1)
filled maybe typo or more info than filled from above, filled, null, ..., Value2 ( different then above), (maybe this one is filled), ... Null (This is same entity as 1)
Then we have other entities entirely
I've been leveraging minhashensemble code and tried a few varieties (indexing per column to deal with nulls better), and concatenating all together with word null for empty fields (or just space for that entry), evaluating different containment scores. A bunch of the varieties seem to produce slightly better performance on some situations and slightly worse on others. Does anyone know of a better way to approach this type of problem or recommend a resource to dive a bit deeper into figuring out what may work for this problem?
I have a dataset that has is very sparse. That is, it has multiple null fields and multiple variations of the same entity.
Essentially, FN, LN, field1, field2, ... , fieldk, ... fieldN filled, filled, null, null, ..., Value, Null, ... Null (This is entity 1) filled maybe typo or more info than filled from above, filled, null, ..., Value (with typo), (maybe this one is filled), ... Null (This is same entity as 1) filled maybe typo or more info than filled from above, filled, null, ..., Value2 ( different then above), (maybe this one is filled), ... Null (This is same entity as 1) Then we have other entities entirely
I've been leveraging minhashensemble code and tried a few varieties (indexing per column to deal with nulls better), and concatenating all together with word null for empty fields (or just space for that entry), evaluating different containment scores. A bunch of the varieties seem to produce slightly better performance on some situations and slightly worse on others. Does anyone know of a better way to approach this type of problem or recommend a resource to dive a bit deeper into figuring out what may work for this problem?