OpenWaterAnalytics / epanet-dev

Development Repository for the object-oriented implementation of EPANET
MIT License
65 stars 54 forks source link

Does HashTable need to wrap std::map ? #10

Open samhatchett opened 8 years ago

samhatchett commented 8 years ago

so this doesn't get lost - @LRossman had a suggestion after #5 was merged :

I suggest that we keep the HashTable class as an interface and populate it with the STL map instead of the custom hash table. [...] Making HashTable an interface provides more flexibility for future changes without having to modify the Network module code.

Any discussion on this topic?

I will put forward my own thoughts - that this is probably not needed unless there's actually a high likelihood that we will revisit the hashmap implementation. The STL provides some really great performance for this, and it's all internal to the library anyway - these maps are not now part of the public interface. So it seems like such a minor detail to spend too much time obfuscating.

LRossman commented 8 years ago

Sorry, but I just think it's a better design to put the map in the HashTable class rather than directly in the Network class. The extra code added is minimal, performance shouldn't suffer, it makes for a cleaner Network::indexOf() function and makes it easier to optimize the performance of the map in the future should someone care to pursue that (e.g., by switching to an unordered_map with a custom hash function and memory allocator). I realize it's a "six of one, half dozen of another" issue, but I think we should keep our options as open as possible at this time and not hard wire a specific type of hashing routine into the Network class.

samhatchett commented 8 years ago

Please no apologies! We are here to discuss the relative merits of ideas so that the product is better in the long run. @LRossman I understand your reasoning and motivations. These are good things to think about.

A few more good things to think about:

Changing from map to unordered_map is a single-line change since they have almost the exact same interface, and adding a custom hash algo or allocator would just mean referencing those functions in the template declaration. No need for a wrapper class in that case.

Also, unordered_map offers conveniences not currently offered by the HashMap wrapper like iterators and operations like erase, clear, swap as well as being able to reserve space to speed up inserts. To use these container methods, we would have to wrap all of them as well and expose through the HashMap interface. So in my thinking, using a standard container in this way is actually more future-proof than wrapping since the wrapper just becomes a maintenance issue and does not materially improve the usability.

Now, I think a justifiable reason to use a custom container class would be to combine various useful accessor patterns that aren't normally offered together, as indicated earlier by @michaeltryby when he wrote about a python class that has a hybrid map/list interface. And honestly I could see a need for that. @ttaxon just opened issue #11 related to keeping track of element subtypes. This is a piece of network-domain information that is not tracked by the Network class. This also brings up an important point that right now, the Network class does keep track of a vector of elements and also a HashMap of those same elements, and each element is tasked with remembering its index in its containing vector. This level of information sharing between objects with a containment relationship is a potential maintenance problem.

Seems like we can come up with a better solution, and maybe it looks like Michael's suggestion, or maybe settling on either a vector or a map would suffice. We could conceivably get rid of the vector containers since map offers iterators. I would also suggest that there's no need for each element to track its own index, or even to try to access elements in this way (using indices). That usage pattern just indicates to me that we are not using containers to their full potential.

LRossman commented 8 years ago

@samhatchett node and link indexes are needed to implement network topology functions, hydraulic matrix construction, and various pieces of the hydraulic and water quality solvers. For example, how would you modify the code below from diagnostics.cpp that checks for nodes with no connecting links if there was no "index" property associated with a Node?

    int nodeCount = nw->count(Element::NODE);
    vector<char> marked(nodeCount, 0);
    for (Link* link : nw->links)
    {
        marked[link->fromNode->index]++;
        marked[link->toNode->index]++;
    }
samhatchett commented 8 years ago

Ah, making more sense now. Off the cuff, if it's just the solver that needs indexes, then maybe the elements could be blind to that fact to reduce information-sharing between them.

For topology issues, BOOST has a great header-only graph library. I've used it to do some graph-theoretical analyses on networks - could make sense to offer an adapter of some sort to leverage this library since it offers several search and topological sorting algorithms. I've excerpted/modified some code from the RTX project below, which uses a temporary indexing to build an adjacency list - so to build the BGL representation and then find out if there are any disconnected nodes:

// build the graph rep
boost::adjacency_list <boost::vecS, boost::vecS, boost::undirectedS> Graph;
map<shared_pointer<Node>,int> nodeIndexMap;
for (auto node : _nodes) {
  nodeIndexMap[node] = iNode++;
}
for(auto link : _links) {
  int from = nodeIndexMap[link->from()];
  int to = nodeIndexMap[link->to()];
  boost::add_edge(from, to, Graph); // BGL add edge to graph
}
// find the number of connected components and get the membership list
vector<int> componentMap(_nodes.size());
int nGraphs = boost::connected_components(Graph, &componentMap[0]);
bool hasDisconnectedNodes = (nGraphs > 1);
michaeltryby commented 8 years ago

Simpler is better.

samhatchett commented 8 years ago

@michaeltryby I'm not sure how to interpret. Simpler syntax, algorithm, or dependencies? I think using BGL is extremely simple and lets me use many graph theoretical concepts at a pretty high level. But I'm not sure you would agree?

michaeltryby commented 8 years ago

@samhatchett Lew's solution strikes me as simpler by all the measures you suggest. The code is clear and concise. I can immediately tell exactly what it is doing. And it doesn't introduce any dependencies.

By the same token I can appreciate how your code was a good solution to the problem you were solving. Which was quite different than the one Lew is solving. Indexes are widely used for setting up matrices in the solver code.

If I had to choose, I would get rid of the map before I got rid of the vector in the Network class.

samhatchett commented 8 years ago

I figured that mentioning BOOST would get @michaeltryby's attention 😄

Continuing the Tangent... Please, compare the two implementations. I've tested this with a very large network and the performance hit is negligible. The below assumes that we keep the element index ivar.

First, the original code:

int nodeCount = nw->count(Element::NODE);
vector<bool> marked(nodeCount, 0);
for (Link* link : nw->links)
{
  marked[link->fromNode->index] = true;
  marked[link->toNode->index] = true;
}

int unmarkedCount = 0;
for (int i = 0; i < nodeCount; i++)
{
  try
  {
    if ( !marked[i] )
    {
      unmarkedCount++;
      if ( unmarkedCount <= 10 )
        throw NetworkError(
                           NetworkError::UNCONNECTED_NODE, nw->node(i)->name);
    }
  }
  catch (ENerror const& e)
  {
    nw->msgLog << e.msg;
  }
}
if ( unmarkedCount > 10 )
{
  nw->msgLog << "\n\n NETWORK ERROR 233: no links connected to another "
             << (unmarkedCount - 10) << " nodes ";
}
return (unmarkedCount == 0);

... and the below uses BGL. Use your imagination to add some logging.

// build an adjacency list
boost::adjacency_list <boost::vecS, boost::vecS, boost::undirectedS> Graph;
for(auto link : nw->links) {
  int from = link->fromNode->index;
  int to = link->toNode->index;
  boost::add_edge(from, to, Graph); // BGL add edge to graph
}

// find the number of connected components
vector<int> componentMap(nw->nodes.size());
int nGraphs = boost::connected_components(Graph, &componentMap[0]);

if (nGraphs > 1) {
  // do some logging here. componentMap[node_index] is the graph_index
}
return (nGraphs == 1);

Personally, I find the second to be more concise, more readable, and probably more maintainable.

I'm not going to fight anyone on this - i understand how sacred epanet's zero-dependency tradition is. And in no way am I trying to bring about "dependency hell". However, there is a balance to be struck - BOOST is widely used and respected. At a certain point we must look around and notice all the solved problems that we are not leveraging.