Caruychen / 42Hive-Lem_in

Integrated smart ant colony transport system
2 stars 0 forks source link

Hashtable or vec + binary search for adjacency list and parsing? #6

Closed crl-n closed 2 years ago

crl-n commented 2 years ago

As I think you said at one point, a hashtable would be useful in this project. I'm writing this issue to summarize some thoughts I've had relating to this. Feel free to comment, I'd love to know your thoughts on this and which approach you prefer or if you have some other ideas.

There are at least two points in the program where nodes will have to be looked up using their aliases. The first occurs during parsing, the second during edmonds-karp. I'll focus here on the first as an example, because it seems to me it will involve far more look-ups (in edmonds-karp you probably only have to look-up the start node, after that you can just follow the pointers contained in the edges).

After parsing the lines containing information about rooms we need to parse the links/edges. Here, due to the format of the input we have to look up each node using their alias (not their index). This means two lookups per link/edge, because the input is in the format [alias of node a]-[alias of node b].

Let's assume the parser saves the nodes straight into the adjacency list. The naive approach would be to iterate over the adjacency list and checking each node's alias for a match. I believe this solution would have a time complexity of O(n), where n would be the number of rooms. This would be get slow quickly as graph size increases.

A better approach would be to use binary search, giving a time complexity of O(log n). This would probably be sufficient to keep the parsing relatively quick with fairly big graphs. This method has the added bonus that the adjacency list could remain a vec, meaning no other data structure would have to be implemented.

The most efficient way to do the look-ups would probably be to use a hashtable. Hashtable lookup is at best O(1) and O(n) at it's worst. The key to achieving O(1) is to know the data in advance, allowing you to make sure there are a minimum of duplicate hashes being generated. In our case we know fairly well what our data will look like, although different maps could have very differentlly named rooms, which could cause the hashtable to perform differently on different maps. The hashtable approach obviously also means having to implement a hashtable and an efficient hashing function. Basically it is more work, but if done well it would be the most efficient solution.

Here's a good stack overflow discussion on hashtable look-up time complexity

Caruychen commented 2 years ago

These are really interesting suggestions! I hadn't considered using a binary search method before, it could potentially work just as well or better than a hash table.

Firstly though, I would say that we only need to apply this hash or binary lookup stage once. For the parsing stage only. It won't be necessary for Edmunds-Karp algorithm when solving the problem. Here's why (we are assuming a valid map):

Another way to think of this is, what do we use the hash table for? As a quick way to map between name and a unique index integer. Since our graph data structure already stores the rooms index numbers in their corresponding "edge" data structure, we can bypass any name to number lookups. Hopefully that all makes sense!

Okay, regarding the binary search. I think it's indeed a great idea, and I'm confident that it won't be too difficult to do. So if a hashtable is too much, we can also attempt a binary search. I can imagine that they can all exists as another modular component that can be easily inserted or removed from the project.

Regarding hash performance, I think the question of knowing the data in advance might have more to do with the size of the data (number of rooms). From my research, and the link you showed me, the main performance issue relates to when the input data exceeds the given capacity of a hash table.

I remember seeing that most simple hashing algorithms need to assume a specific capacity. When the data grows larger, some hashing strategies "re-hash" which you've probably already seen mentioned. It's an expensive process (it's like reallocating memory for a bigger array, but more complicated).

I recall from my last bit of research into hash tables I saw something called "linear probing" https://www.geeksforgeeks.org/hashing-set-3-open-addressing/ I think it was a process that deals with this dynamic hashing (hash tables that grow). Maybe...it's been a while so I'll need to look into it again later.

Otherwise, we can always divide and conquer ⚔️.

crl-n commented 2 years ago

Great to get hear your thoughts on this. Also, I hope you are having a great vacation! 🏖 And you probably know this but just remember that you don't have to respond to these issues while you are away if you don't want to, they're not urgent at all and can wait until you come back!

Binary search is fairly easy to implement. It does require us to sort the vec as well, so we'd have to implement a sorting algorithm, preferably something more efficient than bubble sort. In any case, it shouldn't be that difficult to do in the end.

About the question about what the hash table is for. Yeah, it makes sense that we don't need the hash table for looking up nodes in Edmonds-Karp. At first I thought that maybe we would need it for the source and sink nodes, but obviously that can be avoided by just saving the source and sink indices instead of their aliases.

I'm thinking (and I think you might maybe be suggesting this a bit) that we could build a hash table around the vec. Then we would only need to use the hash table interface in the parsing. In Edmonds-Karp we can just use the vec as a vec.

So I think divide and conquer could be a good idea. We could implement both and then just compare performance. That would be a very interesting thing to do. I think I will start with the hash table, because I've actually been wanting to implement one for quite some time now. What do you think, should I implement it into the project's libft or just have it in srcs along with the other project code?

Anyways, I will start looking into the hash table soon-ish, when I get the parser more or less finished.