bitwalker / libgraph

A graph data structure library for Elixir projects
MIT License
524 stars 75 forks source link

Unexpectedly endless pathfinding from the nth size. #72

Open centuriononon opened 9 months ago

centuriononon commented 9 months ago

Machine: Intel Core i5-8250U (1.60GHz), 16gb RAM OS: Ubuntu 22.04 Elixir version: Elixir 1.15.0-rc.0 OTP version: 25.0 libgraph version: 0.16.0

I use graph to accumulate pages while I crawl. The crawler is a stream of pages: above page (previous) and sub page (current). Such relationship I have to record to find paths between pages later.

I noticed that from nth node it becomes impossible to compute paths. Literally, when my crawler reached a goal and needed to collect paths from A to B pages, it took all night and there was no calculation result (the calculation was definitely still going on, judging by the load).

I decided to log every thousandth relationship recorded and try to find all paths from the starting page to the current page (I record page to subpage relationship).

The results are as follows (relation a sub page to the above page -> time to find all paths from the first page to the sub page):

I used this to record relationship:

graph
|> Graph.add_vertex(abv_ref, abv)
|> Graph.add_vertex(sub_ref, sub)
|> Graph.add_edge(abv_ref, sub_ref)

Where abv and sub are structs, abv_ref and sub_ref are strings.

Tested like this:

IO.puts("Task now: #{x}")
if rem(x, 1000) === 0 do
  IO.puts("Trying to get all paths from the start to current urls...")

  {d, result} = :timer.tc(fn ->
    Graph.get_paths(graph, start_ref, sub_ref)
  end)

  IO.puts("Got result in #{d}µ (#{div(d, 1_000_000)}s): #{inspect(result)}")
end

Where result is a list of strings, start_ref is string as well.


I don't know if my problem is unique or if it's a general property of the library graph. Is there a benchmark that tests such a large amount of relations? Send links if available. If not, my suggestion is to make one.

bitwalker commented 9 months ago

The get_paths function performs a depth-first traversal of the entire graph in order to discover all the paths, so it is expected that it would get increasingly expensive as the size of the graph increases, but 11k nodes should not be a problem. I've certainly worked with larger graphs with this library.

That said, it isn't clear to me what the structure of your graph is: How dense is the graph? What is the depth vs breadth? How long are the paths you expect to observe on average? All of this will help discover where the bug here is.

centuriononon commented 9 months ago

The get_paths function performs a depth-first traversal of the entire graph in order to discover all the paths, so it is expected that it would get increasingly expensive as the size of the graph increases, but 11k nodes should not be a problem. I've certainly worked with larger graphs with this library.

Okay, I see.

That said, it isn't clear to me what the structure of your graph is: How dense is the graph? What is the depth vs breadth? How long are the paths you expect to observe on average? All of this will help discover where the bug here is.

I am going to reproduce the same conditions as I had in that case. It was about multiple layers in depth and large amount of nodes in breadth (amount 200-1000 per node). So, I would say we can try to create similar structure manually with control within a test.

It may take a while, but once I get started and there are little results, I'll let you know!