1.3 Specifications - Githubissues

gephi / gexf

GEXF Format Specifications

https://gexf.net/

Creative Commons Attribution 4.0 International

31 stars 6 forks source link

1.3 Specifications #13

Closed mbastian closed 2 years ago

mbastian commented 2 years ago

Attempt to create clean 1.3 specifications. Changelog is provided but for discussion purpose I'll post it here too:

Notes:

Source files are only RNC and XSD/RNG files are auto-generated via the build script.
I couldn't find the option with trang to set XML version to 1.1. I know this was a popular request but I'm afraid we'll need to stick to 1.0 unless we find a solution

Possible improvements

I'm not so familiar with modularity concepts in Relax NG. I tried to simplify what we had before but I'm sure it could further be improved
Some data types like hex color are just strings and could probably be a more specific type (suggestions welcome)

Changelog

Add kind attribute on edge to support multi-graph (i.e. parallel edges)
The edge weight is now a double instead of a float
Add xsd:longas possible idtype on <graph>
Add new attribute types bigdecimal, biginteger, char, short and byte
Add new list attributes like listboolean or listinteger for each atomic type

Dynamics

Add a timezone attribute on <graph> to use as a timezone in case it's omitted in the element timestamps
Open intervals attributes startopen and endopen are removed. Use regular inclusive start and end instead
Remove mode, start and end attributes on <attributes> as it was redundant with <graph> attributes

Timestamp support

Add the ability to represent time with single timestamps instead of intervals. We want feature parity between the two time representations but note they can't be mixed.

Add a timerepresentation enum in <graph> with either interval (default) or timestamp to configure the way the time is represented
Add timestamp attribute to <node>, <edge>, <spell> and <attvalue> to support this new time representation

Alternative to spell elements

Add a timestamps attribute to <node> and <edge> to represent a list of timestamps without having to use spells
Similarly, add a intervals attribute to <node> and <edge>

New slice mode

The optional mode attribute on <graph> now has an additional slice value, in addition of static and dynamic. With slice, the expectation is that the <graph> also has either a timestamp or start / end intervals.

Add a timestamp attribute on <graph> to characterise the slice this graph represent
Change the meaning of the start and end attributes on <graph> to either characterise the slide instead of the time bounds, which should rather be inferred

Viz

Add hex attribute on <color> so it can support values like #FF00FF
The z position is no longer required
Dynamic attributes like start, end or child elements <spells> are no longer supported for viz attributes. To represent viz attributes over time, an alternative is to create multiple graphs each representing a slice

paulgirard commented 2 years ago

Thank you @mbastian for moving this forward.

On top of my head I have two issues with the GEXF format. I am sharing it here without any opinion on whether they should be included in the new spec or not. I am just sharing what I experienced.

viz attributes' origin

In my use cases 90% of times viz parameters are actually derived from attributes. It would be very nice to have the possibility to declare something like an origin in viz properties not only to avoid repetition (partition color) but also to document where does those viz parameters come from. One important difficulty to have dynamic viz attributes is for quantitative parameters (ranking in Gephi) which are often applied through a transformation/scale method (spline in Gephi). For qualitative (partition in Gephi) it's much easier: the viz parameter could be added to the origin attribute declaration and the node/edge viz attribute could be declared as a reference to the origin node/edge attributes viz property. One other simpler possibility would be to drop dynamic declaration but to add a more documentation purpose viz attribute to simply indicate if and which node/edge attribute were used to compute a viz value.

To sum-up, the purpose of such a feature is to help drawing legend in any Gexf visualizations.

I am aware that this feature would probably require to change some internal Gephi feature too.

optional ID for edges

A long time ago I stumbled upon an issue with the networkx implementation of Gexf https://github.com/networkx/networkx/issues/1296 This was actually linked to ids on edges which triggers this question: why are edge ids mandatory? I totally understand that some use case require identifying edges but a gexf actually does not require ids en edges as edges are never referred to inside a GEXF (I think?). Wouldn't dropping requirement on edges ids avoid issues related to id collisions when updating a GEXF while still allowing use cases which need ids?

These are rough comment. It would need a more in depth work. I would be glad to contribute in these directions if the maintainers and the community think there might be a place/need for it.

mbastian commented 2 years ago

Thank you @mbastian for moving this forward.

My pleasure, long overdue!

viz attributes' origin

In my use cases 90% of times viz parameters are actually derived from attributes. It would be very nice to have the possibility to declare something like an origin in viz properties not only to avoid repetition (partition color) but also to document where does those viz parameters come from. One important difficulty to have dynamic viz attributes is for quantitative parameters (ranking in Gephi) which are often applied through a transformation/scale method (spline in Gephi). For qualitative (partition in Gephi) it's much easier: the viz parameter could be added to the origin attribute declaration and the node/edge viz attribute could be declared as a reference to the origin node/edge attributes viz property. One other simpler possibility would be to drop dynamic declaration but to add a more documentation purpose viz attribute to simply indicate if and which node/edge attribute were used to compute a viz value.

To sum-up, the purpose of such a feature is to help drawing legend in any Gexf visualizations.

I am aware that this feature would probably require to change some internal Gephi feature too.

Very interesting idea, thanks for sharing. I had not appreciate how important it might be to connect the viz attributes to the element attributes, and what it could enable. Having the ability to eventually draw a legend just based on the GEXF seems like an attractive idea indeed. I would love to hear more opinions on how we might implement this.

An "origin" attribute could indeed be appropriate. I have a bit trouble to follow the rest of your suggestion so maybe you can paste some examples of how you would see this implemented.

optional ID for edges

A long time ago I stumbled upon an issue with the networkx implementation of Gexf networkx/networkx#1296 This was actually linked to ids on edges which triggers this question: why are edge ids mandatory? I totally understand that some use case require identifying edges but a gexf actually does not require ids en edges as edges are never referred to inside a GEXF (I think?). Wouldn't dropping requirement on edges ids avoid issues related to id collisions when updating a GEXF while still allowing use cases which need ids?

Makes a lot of sense. In fact, the GEXF importer in Gephi doesn't even throw a warning if you omit the edge ids. It creates them dynamically if they are missing. The processor also seems to rely on source+target+type when merging so even in those more complex cases the edge id seems superfluous. I'll need to double check that but if that's the case I agree it could be made optional in the spec.

Yomguithereal commented 2 years ago

Hello @mbastian, one specific thing which I don't completely understand with gexf, and it might just be that I did not understand the specification as clearly stating this but is there an enforced order of tags in a gexf file? What I mean is, is this stated somewhere that the model must be declared before the nodes and then the edges. I am asking because I already found in the wild gexf where the order was shuffled and I remember this to break Gephi for instance (this might not be the case anymore.

One other pain point, on top of my head, was the confusion sometimes between the id, the title and the for regarding attributes that sometimes makes difficult to chose, from the parser's perspective, which to chose between attribute id and title. For instance, when converting to JSON, typically, one would like a useful named id, not an incremental integer, and this one can be found sometimes in id, sometimes in title, sometimes it cannot because title is a human-readable label and id is incremental.

mbastian commented 2 years ago

Hello @mbastian, one specific thing which I don't completely understand with gexf, and it might just be that I did not understand the specification as clearly stating this but is there an enforced order of tags in a gexf file? What I mean is, is this stated somewhere that the model must be declared before the nodes and then the edges. I am asking because I already found in the wild gexf where the order was shuffled and I remember this to break Gephi for instance (this might not be the case anymore.

Sorry what do you mean by model here?

One other pain point, on top of my head, was the confusion sometimes between the id, the title and the for regarding attributes that sometimes makes difficult to chose, from the parser's perspective, which to chose between attribute id and title. For instance, when converting to JSON, typically, one would like a useful named id, not an incremental integer, and this one can be found sometimes in id, sometimes in title, sometimes it cannot because title is a human-readable label and id is incremental.

Makes sense. Normally the title shouldn't be used at all for a reference. But it address this problem, tis is why I propose this alternative way of representing attribute values, which I think would be more JSON friendly:

In 1.2, we do require attributes to be defined:

<graph defaultedgetype="directed">
    <attributes class="node">
      <attribute id="0" title="url" type="string"/>
      </attribute>
    </attributes>
    <nodes>
      <node id="0" label="Gephi">
        <attvalues>
          <attvalue for="0" value="http://gephi.org"/>
        </attvalues>
      </node>
    </nodes>
  </graph>

In 1.3, we could support an alternative way that would omit the attributes and simply list the id+type in the attvalue XML attribute:

<node id="42" label="node A">
   <attvalues>
       <attvalue id="url" type="string" value="http://gephi.org">
   </attvalues>
</node>

What do you think? For the parsers it most likely wouldn't be a large change and it would be a lot more JSON friendly.

Yomguithereal commented 2 years ago

Sorry what do you mean by model here?

By model I mean the attributes definition. What I mean is that implicitly a gexf file should be ordered thusly:

graph
  attributes
  nodes
  edges

Which make senses, especially if you need to stream the xml file for some reason. But I am unsure whether this order is enforced by the specs. And I have already seen weird things in the wild such as:

graph
  edges
  attributes
  nodes

for instance, produced by some xml writers that work on unordered key-value structure conversion. I think this order was breaking Gephi import at some point (it might still be the case).

What do you think? For the parsers it most likely wouldn't be a large change and it would be a lot more JSON friendly.

Does this mean the attributes declaration on top of the file would not be mandatory anymore? In which case it sounds like a bad idea, especially for parsers that need to allocate static amount of memory beforehand having a knowledge of the attributes, no?

mbastian commented 2 years ago

Which make senses, especially if you need to stream the xml file for some reason. But I am unsure whether this order is enforced by the specs. And I have already seen weird things in the wild such as:

Good point, as far as I can see the nodes and edges order is enforced but attributes not given that it's in a different data.rnc file that is included in gexf.rnc. I'm not sure we can easily fix this after browsing the Relax NG documentation but will need to look deeper.

Does this mean the attributes declaration on top of the file would not be mandatory anymore? In which case it sounds like a bad idea, especially for parsers that need to allocate static amount of memory beforehand having a knowledge of the attributes, no?

Yes that's what I had in mind. How bad is it really? The only difference would be to allocate when you first see a new id versus at the beginning. For which parser do you think that could be an issue?

Yomguithereal commented 2 years ago

For which parser do you think that could be an issue?

I do not know nor use such a parser currently I think, but any low-level language parser that would define some kind of static size struct for nodes & edges based on attributes declaration would probably have issues with the fact that now attributes might be defined on the fly when perusing nodes or edges (what's more, we can imagine some attributes not existing on the first nodes but on subsequent ones, if the attribute can be undefined in the source data/language representation, such as it can be the case with undefined in JS->JSON). This would not be the issue if you allocate based on column representation (like a dataframe for instance), except for the fact that you don't know the number of nodes/edges beforehand in the gexf format.

Another argument could be one of the complexification of parser implementation because now you have to consider two different methods of attribute declaration.

Another question would also be: what should happen if I have a node:

<node id="42" label="node A">
   <attvalues>
       <attvalue id="url" type="string" value="http://gephi.org">
   </attvalues>
</node>

and another one:

<node id="43" label="node B">
   <attvalues>
       <attvalue id="url" type="double" value="4.5">
   </attvalues>
</node>

with different types for the same attribute?

Should this be ok/tolerated? Should this raise some kind of validation error?

mbastian commented 2 years ago

Thanks @Yomguithereal !

Another argument could be one of the complexification of parser implementation because now you have to consider two different methods of attribute declaration.

That's right. Let's leave this out for 1.3 version then to avoid overcomplexifying the parsers. It was a nice to have anyway.

gvegayon commented 2 years ago

@Yomguithereal

Should this be ok/tolerated? Should this raise some kind of validation error?

I think it should be a plain error. Also, although defining the attr type on the fly is possible, I think it is better to be consistent, you know, KISS.

@mbastian @Yomguithereal

Which make senses, especially if you need to stream the xml file for some reason. But I am unsure whether this order is enforced by the specs. And I have already seen weird things in the wild such as:

Defining the nodes at the beginning has one big practical benefit: identifying errors faster. For example, I have often captured errors in some network datasets with ties to undeclared nodes. This may not be as important in small networks, but if you are analyzing a very large file, it could have some performance benefits. I imagine the parser processing an edge, and before continuing, checking against a hash table (not sure how the parser of Gephi is implemented) and making sure the nodes were declared; if not, then throw an error.

mbastian commented 2 years ago

@gvegayon Agreed. I changed the specs to make sure the order is enforced: attributes -> nodes -> edges.

mbastian commented 2 years ago

After today's discussion over Zoom and the latest tweaks I'm confident the specification and the primer is ready to be shipped. We can make some fixes in the documentation later.