fkling / JSNetworkX

Build, process and analyze graphs in JavaScript (port of NetworkX)
https://felix-kling.de/jsnetworkx/
Other
765 stars 185 forks source link

Support generic objects as nodes #3

Closed jamiefolson closed 11 years ago

jamiefolson commented 11 years ago

Since this is one of the main difference with the Python library, I thought I'd start an issue.

jamiefolson commented 11 years ago

There are lots of ways this could be done.

My vote would be to accept a key function on graph creation:

var G = new jsnx.Graph(function(n){return n.id});
G.add_node({id : 1})

This would currently have to accomplished with:

var G = new jsnx.Graph();
var n = {id : 1}
n.toString = function(n) {return n.id}
G.add_node(n)

for every node

Or you would make everything a specific object:

var node = function(){}
node.prototype.toString = function(n){return n.id}

Allowing a key function would create one massive problem. The [ accessor could no longer be supported because you'd need to call the key function, but [ cannot be overridden. IMO, the API should be changed to make jsnx.Graph.node a function, for this particular reason. In javascript, making [ a part of the API makes the implementation part of the API because of the inability to override operators. Or I'd at least got for adding and prefering the node() function. Since functions are objects, you can have Graph.prototype.node be a function that also contains the node data as properties.

fkling commented 11 years ago

I like the idea of a key function.

However, I cannot foresee what impact this has on the rest of the code. Sure, such fundamental design decision should be made early in the development process, but I hesitate doing any big changes now (I'm more focused on the visualisation right now). I also think you could easily do a mapping outside the graph or add the corresponding object as data to the graph.

I'm happy about any suggestions in this direction though and seeing someone implementing this on a new branch would be great!

jamiefolson commented 11 years ago

Okay, sounds great. I'm exploring it because it'd simplify some of the things I want to do. I'll play around a bit locally and see what it looks like.

In terms of API changes, the previously mentioned '[' operators could no longer be used with this because of the aforementioned inability to overload operators.

jamiefolson commented 11 years ago

As an update, I believe I've got all the basics working, but there's a lot of tests. I've currently eliminated the hash-style node access, since it was hurting my grey cells to store the adjacency in a function. I'm not sure how big of a problem you'd see this as, but it requires changing accesses from:

G.node['Bob']
G.adj['Bob']['Bill']

To

G.node('Bob')
G.adj('Bob').get('Bill')

Which, while somewhat more verbose, I actually prefer, since it disconnects the implementation from the api as well as being at least as clear to my eyes.

fkling commented 11 years ago

Sounds great! I actually just came across a situation in the library itself, where generic object support would be at least a bit cleaner.

I also gave it some thought, and I think the whole change should not be too complicated.

I'm curious about your implementation, are you using goog.structs.Map or something similar?

The goog.structs.Set class has a private method to compute a key for any primitive value or object. I guess it could be very useful:

/**


(Markdown does not seem to work for email responses :-/)
fkling commented 11 years ago

Ok, I was giving the whole thing some thought and came up with this key function (which is mostly based on goog.structs.Struct.getKey_):

function getHash(val, opt_keyfunc) {
  var type = goog.typeOf(val);
  if (type === 'object' || type === 'function') {
    if (val.toString !== Object.prototype.toString) {
      return val.toString();
    }
    else if(opt_keyfunc) {
      return opt_keyfunc(val);
    }
    return goog.getUid(/** @type {Object} */ (val));
  }
  else if (type === 'array') {
    var key = 'a';
    for (var i = 0, l = val.length; i < l; i++) {
      key += jsnx.NodeContainer.getHash(val[i]);
    }
    return key;
  } 
  else {
    return type.substr(0, 1) + val;
  }
};

The ideas for converting objects and arrays were the following:

In this example, arrays are treated like tuples in Python, i.e. two different arrays with the same elements, e.g. [1,2] are treated as the same array. This has the advantage that it is easy to work with such data, but it is more difficult to distinguish between lists of nodes and a single node. Alternatively, we could provide an implementation of a tuple, e.g. jsnx.t(1,2) which gets treated in a special way by the graph methods.


In addition to that I was thinking about how the graph classes have to be changed internally I think I came up with a good solution, but I wanted to see first what you've got so far.

jamiefolson commented 11 years ago

Sorry to keep you waiting. I spent an excess of time first making sure I could get things setup for all the graph classes, then trying to update the unit tests. You can find what I've got over in my fork.

My strategy was to use a modified version of goog.structs.Map that supports various ways of specifying a hash function:

The advantages of using an actual Map class are mainly to contain the mechanics in goog.structs.Map for maintaining an internal list of keys/hashes, which enables faster iteration. I'm also working on a general-purpose hashmap for objects, so it was convenient. Unfortunately, this resulted in an API change. I hadn't realized what a big deal this would be until I saw how many unit test you've translated.

I thought I'd push what I've got right now to my fork and let you check it out. It could make sense to actually move the Map internals into the Graph class(es), since there's really only one set of keys/hashes. Does that sound cleaner? That way you could preserve the [ access by continuing to expose the objects and simply contain the objects in a function object instead of a {} object.

fkling commented 11 years ago

Thanks for the update and all your effort!

I had a look at your code and it comes quite close to what I imagined. I also would have used maps based on goog.structs.Map. However, after some more research I found out that ES6 will introduce native maps and sets. Since the node maps would be exposed to the client, IMO it makes sense to strive for an interface that is forward compatible with ES6 maps.

As I already said on the other thread, I'd rather prefer not to restructure the code too much, if it does not add missing features.

I think, based on your code, I have a fairly good idea of how the end result could look like, without making too many drastic changes. You stirred my interest (thank you) and I think now that this is actually an important issue, so I will actively look into this. I think, since I'm more familiar with the code base, it might be a bit easier for me to make the necessary changes.

jamiefolson commented 11 years ago

i tend to agree. I had not quite appreciated how much pain there would be in changing the api like that. Since there appears to already be an implementation of ES6 Map , it might make sense to use that. Anyways, I'll go back to translating some of the centrality measures. Hopefully that will be more useful.

I do think you'll want a wrapper around the ES6 Map. Since it's definition of equality is limited to identity, in many cases it'd still be useful to map objects to hashes before inserting them. For example, I'm using d3 to draw things and I'd like to store the actual node data as the JSNetworkX node so that the JSNetworkX nodes are also the d3 data. However, it's my understanding that any time I update the data from the server, I'd have new objects and be unable to retrieve anything if objects are used as keys in an ES6 Map.

fkling commented 11 years ago

So, it's been a while, but I made some progress (I was busy with unrelated things as well). I made a performance comparison of both the current version and the generic node version with respect to graph creation. It turns out that with generic nodes, graph creating can be up to 50% slower (tested various numbers of nodes, up to one million). Of course this is not great, but I expected that. The question is now, whether the performance tradeoff is worth it. Here are my thoughts:

Pros:

Cons:

Further thoughts:


Based on this, I tend to continue implementing support for generic nodes. Maybe I find a way to create a "faster" implementation of the library for people who care more about speed than the node value (by just changing the map implementation).

Any thoughts on this?

jamiefolson commented 11 years ago

It looks like you haven't committed the changes yet, so I thought I'd just ask: How did you choose to implement it?

I was being really (probably excessively) aggressive about flexibility, which led to using a Map class with get/set methods. This route seemed to substantially break the API and all the tests requiring massive (but aesthetic) modifications of the codebase. The would, however, seem to be the most straightforward way of being compatible with the Harmony-style native maps. My experience with pure-js maps, is that you'd really want to be able to leverage the native maps when available.

If you're using an actual map class, which one? An alternative would be to switch to an array-based adjacency/node properties, with an internal map from node "object" (ie Object/string/number) to index.

Are you doing something similar or did you find a better way to do things?

If your changes decrease the performance by 50% when using primitives, it could be worth keeping as a branch until native maps are a bit more widely adopted. It's my understanding that e.g. in Chrome, it's available but only as a special configuration option.

fkling commented 11 years ago

I was being really (probably excessively) aggressive about flexibility, which led to using a Map class with get/set methods.

That's exactly what I did. Inside the Map class I have just two objects, which map hash -> node and hash -> node value. The interface is actually a super-set of harmonie's interface, I added some handy methods (like getting the size of the map).
Creating a hash for each node is one reason for the performance decrease. The other is just the additional overhead of using the Map interface instead of plain objects, I assume.

In any way, I like the fact the the storage is now "abstracted away" from the actual graph class. It makes it easy to change implementations in the future (as you already said).

I think I found a good compromise: I implemented another Map class, which just distinguishes between numbers and other values. That is, it works like plain objects before, but is able to keep integer nodes as integers. That's the most important aspect for me actually. It's still a bit slower (because of the interface overhead (I guess)), but closer to the object solution than the generic map.

Now I just have to think about a good way to let the user choose between different "node storage backends".

it could be worth keeping as a branch until native maps are a bit more widely adopted.

I honestly think that maintaining two versions for each upcoming feature implementation (mainly algorithms, but also generators), is too much work. At least I don't want to do that ;)

Thanks for your opinion!


I'm pretty busy with other things right now, so I cannot say how long it will take me to finish this. I also started restructuring the drawing API and I would like to release both breaking changes at about the same time.

fkling commented 11 years ago

OK, I did some more testing. I was able to increase performance by simply changing the way how I test for property existence. It's really interesting. It turned out that obj.hasOwnProperty(prop) or prop in obj are a lot(!) slower than typeof obj[prop] !== 'undefined', even if the property does not exist (http://jsperf.com/hasownproperty-vs-in/2). I assume browser are aggressively optimizing property access.

So my initial map implementation could probably be speed up by changing the property existence test. However, I don't like the fact that every value gets hashed. I still think (without tests though) that this has a performance impact.

My final map implementation for now works as follows: Imagine we want to set the key k with value v. The map distinguishes between three types of keys:

  1. numbers
  2. strings
  3. everything else

Each of these types are stored in their own object (k -> v). Since we have a dedicated object for numbers, we know that we have to cast every property name to a number when we iterate over the keys. For strings we can just take the property name as is.

For all other types (mostly objects and arrays), there are actually two objects: One holding the value , sk -> v, where sk is the stringified version of k, and sk -> k, holding the original value of the key.

So, what's the benefit? Since the string representation of arrays is just the concatenation string representations of their elements, each array with primitive values basically acts like a tuple.

For objects, one just has to override the toString method to create a unique hash for an object and they can be properly used as nodes.

While I liked the idea to have extra mapping functions or a special __hash__ property, by just sticking to native JS behaviour, the code can be kept simple and still provide enough flexibility.


Of course this implementation is still slower than the orignal implementation (for primitives), but I believe it's not noticeable in real-word use cases.

By profiling the code I noticed that the garbage collector was running quite often, so this is probably the next point where I will try to optimize the code.

fkling commented 11 years ago

To give a status update: I updated all the graph classes and their tests. I also used the opportunity to switch to a different testing library (mocha), which should make it easier to write test cases (also with respect to Node.js support). I will now update the code and tests for the generators and algorithms.

fkling commented 11 years ago

Another status update: I updated all files and tests. I'm now finishing the changes of the visualization API and will then publish the changes as new, breaking release (I could release the generic node changes now, but I don't want to make two breaking releases in succession).

fkling commented 11 years ago

Since I haven't found a lot of time lately to work on the drawing API, I released the changes for generic nodes now.

The documentation has been update accordingly, including a migration guide: http://felix-kling.de/JSNetworkX/documentation.html.