derrickoswald / CIMSpark

Spark access to Common Information Model (CIM) files
MIT License
15 stars 1 forks source link

Dropped Elements #6

Closed derrickoswald closed 7 years ago

derrickoswald commented 7 years ago

Testing of de-duplication with striped rdf files identifies some missing edges.

Steps to reproduce:

Result: The numbers of edges are 1700967 vs. 1700989 (22 out of 1.7 million edges are missing from the full area conversion, but present in the striped area conversion).

Expected: The numbers should be the same (although the actual number may be different since even the striped conversion may be missing some edges)

Probable cause: The boundary between InputSplits (default 64MB) is not being handled correctly. By changing to a InputSplit size of 256M (ch.ninecode.cim.split_maxsize = 256000000) the number of missing features is reduced to ten. One of the missing edges (KLE447955) lies very close to the InputSplit boundary between splits 88 to 89.

derrickoswald commented 7 years ago

In R, use: diff = (!redges_all$id_equ %in% redges_striped$id_equ)