bitwalker / libgraph

A graph data structure library for Elixir projects
MIT License
524 stars 75 forks source link

Duplicate vertex ids when adding high number of vertices #44

Open DavidB59 opened 2 years ago

DavidB59 commented 2 years ago

When adding large number of vertices, some of my vertices weren't added to the graph.

For example

vertices = 0..250000 |> Enum.map(& &1)
Graph.add_vertices(Graph.new, vertices)
#Graph<type: directed, num_vertices: 249997, num_edges: 0>

I fixed it for myself by changing the function Graph.Utils.vertex_id(v) to def vertex_id(v), do: v instead of def vertex_id(v), do: :erlang.phash2(v, @max_phash)

Apparently I had duplicate key in my graph otherwise.

Sgiath commented 2 years ago

@DavidB59 Advent of Code 2021, day 15, am I right? :D

I had differnet problem, when adding huge number of edges (250_000) there were some edges added which were not present at the edges list. Probably related to ^^

I fixed it by filtering out the bad edges from the graph and then deleting it:

    graph = Graph.add_edges(Graph.new(), edges(points))

    # WTF? I have no idea why those edges are created. They are not returned by edges/1 function
    # but somehow end up in the final graph. Bug in libgraph?
    fail =
      graph
      |> Graph.edges()
      |> Enum.filter(fn %Graph.Edge{v1: {x1, y1}, v2: {x2, y2}} ->
        abs(x1 - x2) > 1 and abs(y1 - y2) > 1
      end)

    graph
    |> Graph.delete_edges(fail)

But your fix is better and works for my case too.

rlipscombe commented 2 years ago

This PR: https://github.com/bitwalker/libgraph/pull/43 allows user-defined vertex IDs, so (knowing that the x,y values don't need hashing, and you've got a fixed size) you could just use y * max_x + x instead of an actual hash.

joegiralt commented 2 years ago

I just want to say, I had exactly the same issue, I, too, was trying to figure out why Advent of Code 2021, day 15 was breaking for me as well, I'm glad I found this.!

bitwalker commented 2 years ago

43 has been merged, and will be released in 0.15.0, which at least provides a solution for situations in which you can provide a better vertex ID than phash2.

I'm not sure why hashing a set of 250k integers produces duplicates with phash2 - the number of unique hashes is 2^32, so I suspect it is some internal detail of phash2 that is causing issues here. That hash was selected due to its flexibility and efficiency, but there are certainly use cases where a better hash is desirable, so #43 is certainly a welcome addition.

renaudlenne commented 2 years ago

43 seems to be a good way to solve this issue, but I think there is an issue with its implementation:

Graph.Pathfinding.do_bfs/5 is still using Graph.Utils.vertex_id

https://github.com/bitwalker/libgraph/blob/f033439e2433a106e4104ed4540f67fd9381a53c/lib/graph/pathfinding.ex#L130-L136

and thus all pathfinding will fail when using a custom vertex id.

I've opened a new PR (https://github.com/bitwalker/libgraph/pull/47) to fix this.

anttipoi commented 2 years ago

I think that hashing the vertex and not taking collisions into account is an incorrect approach when determining the identity of vertices. ID needs to be unique, so collisions are a no-no, right?

Since Elixir has nice value-based semantics for equality, the suggested work-around

def vertex_id(v), do: v

looks like a correct approach.

Large graphs expose the issue, but all it takes is v1 and v2 that hash to same value to cause the problem.

anttipoi commented 2 years ago

I'm not sure why hashing a set of 250k integers produces duplicates with phash2 - the number of unique hashes is 2^32, so I suspect it is some internal detail of phash2 that is causing issues here. That hash was selected due to its flexibility and efficiency, but there are certainly use cases where a better hash is desirable, so #43 is certainly a welcome addition.

Birthday paradox raises its ugly head. A set of n elements contains n(n+1)/2 pairs of elements, or in this case 31_250_125_000. All those pairs need to avoid a collision for hashing to work.

rlipscombe commented 2 years ago

iirc, a 32-bit result only allows for 2^16 items before you exceed a 50% chance of collision, even with perfect hashing (which phash2 is decidedly not). 250K > 65K. A pluggable hashing algorithm (#43) is good because it would allow us to use, e.g., 256-bit hashing.

bitwalker commented 2 years ago

@anttipoi The reason to have a notion of a "vertex id" at all is to avoid having many copies of large vertex values in the graph datastructure. By using integer values, we can keep the overall size of the graph as small as possible. Naturally, this benefit doesn't apply when you are using integers for the vertex values to begin with, which is the source of the issue here. So the pluggable hashing available in 0.15.0 provides a solution for this, but if anyone has additional improvements/feedback around this, I'm certainly open to further discussion.

jkbbwr commented 2 years ago

Im struggling with the same advent of code problem with libgraph, I can't tell if its my code that is wonky (or I am getting a similar but different bug)

https://gist.github.com/jkbbwr/eaec9b319909bd98e17feae54aca8856

I know this isn't a place for getting help with advent of code, but I can't tell if im being dumb or this is a legit bug

stevensonmt commented 5 months ago

I believe this issue can be closed as fixed.

vegabook commented 5 months ago

43 has been merged, and will be released in 0.15.0, which at least provides a solution for situations in which you can provide a better vertex ID than phash2.

I'm not sure why hashing a set of 250k integers produces duplicates with phash2 - the number of unique hashes is 2^32, so I suspect it is some internal detail of phash2 that is causing issues here. That hash was selected due to its flexibility and efficiency, but there are certainly use cases where a better hash is desirable, so #43 is certainly a welcome addition.

selecting 250 000 numbers from 2^32 possibilities is almost certain, probabilistically, to create at least one clash. Using the birthday problem approximate formula:

iex(1)> prob = fn(poss, n) -> 1 - :math.exp(-((n ** 2) / (2 * poss))) end
#Function<43.65746770/2 in :erl_eval.expr/5>
iex(2)> prob.(2 ** 32, 9292)
0.010001099079572806
iex(3)> prob.(2 ** 32, 250000)
0.999308022843814
iex(4)>

99.93% chance.

Even anything above 9k has a 1% chance of a hash collision.

You probably want something bigger eg sha256

Enum.map(1..250000, fn(x) -> :crypto.hash(:sha256, :erlang.term_to_binary(x)) end)

With 256 your probability of a clash becomes tiny, and noticeable only after 10**35 or so which is way bigger than any memory.

prob = fn(poss, n) -> 1 - :math.exp(-((n ** 2) / (2 * poss))) end
Enum.map(0..40, fn(p) -> [ten_to_power: p, num_vertices: 10**p, probability_of_clash: prob.(2**256, 10**p)] end)