Testing of de-duplication with striped rdf files identifies some missing edges.
Steps to reproduce:
a total area is exported as one large area
a total area is exported as a number of strips
compare the number of edges (ch.ninecode.cim.make_edges = true) from the entire area with the number of edges from the striped files (supplied as a comma separated list as the files parameter) after de-duplication (ch.ninecode.cim.do_deduplication = true)
Result:
The numbers of edges are 1700967 vs. 1700989 (22 out of 1.7 million edges are missing from the full area conversion, but present in the striped area conversion).
Expected:
The numbers should be the same (although the actual number may be different since even the striped conversion may be missing some edges)
Probable cause:
The boundary between InputSplits (default 64MB) is not being handled correctly. By changing to a InputSplit size of 256M (ch.ninecode.cim.split_maxsize = 256000000) the number of missing features is reduced to ten. One of the missing edges (KLE447955) lies very close to the InputSplit boundary between splits 88 to 89.
Testing of de-duplication with striped rdf files identifies some missing edges.
Steps to reproduce:
Result: The numbers of edges are 1700967 vs. 1700989 (22 out of 1.7 million edges are missing from the full area conversion, but present in the striped area conversion).
Expected: The numbers should be the same (although the actual number may be different since even the striped conversion may be missing some edges)
Probable cause: The boundary between InputSplits (default 64MB) is not being handled correctly. By changing to a InputSplit size of 256M (ch.ninecode.cim.split_maxsize = 256000000) the number of missing features is reduced to ten. One of the missing edges (KLE447955) lies very close to the InputSplit boundary between splits 88 to 89.