Reading and writing graphs

nfoti commented 11 years ago

I was going to implement reading and writing graphML and gexf files. Before I created a dependency on an external package I wanted to see what people thought of XML.jl and libExpat.jl. It seems like both packages can only parse XML, so writing these file formats may be a non-starter right now. Please correct me if I am mistaken in which case I can start.

Alternatively, I have Matlab code to read and write gml files that I could port to Julia. The code handles a useful subset of the gml specification (and can be extended to handle more of it) in order to get graphs in and out of Julia until XML facilities are more complete.

Thanks.

johnmyleswhite commented 11 years ago

I must be confused, but why would XML.jl be able to parse something that's not XML?

Porting the Matlab code sounds like a good start if GML requires large changes to the XML libraries.

lindahua commented 11 years ago

Implementing graph file read/write functions in this package may introduce a lot of external dependencies (e.g. XML.jl). In many applications, users may only use the graph types and algorithms.

What about separating file read/write things to a new package: GraphIO.jl, which can depend on Graphs.jl

cc: @rsofaer

rsofaer commented 11 years ago

I'm pretty sure GraphML and gexf are both XML. I had a day of confusion not long ago when I though GML was an abbreviation of GraphML, rather than a totally different format. http://graphml.graphdrawing.org/specification.html

I've been thinking about expanding dot.jl to read as well as write, which would probably depend on a parser library. I see your point about external dependencies, but I think on balance, the convenience of always being able to read, write and plot (which uses dot.jl for serialization) is pretty big. On the other hand, a doubled loading time is pretty big also:

With using LibExpat in src/Graphs.jl:

julia> @time require("Graphs.jl")
elapsed time: 4.555009312 seconds

Without using LibExpat:

julia> @time require("Graphs.jl")
elapsed time: 2.52399976 seconds

I tried both of these a few times and it was consistent.

nfoti commented 11 years ago

I don't have a problem with having a separate GraphIO package to avoid the dependencies.

GML is an entirely different format that is not XML-based. The networkx package uses pyparsing to construct a parser for the format that they can call from python. My code implements a simple state-machine that supports a subset of the format.

Also, to clarify John's comment, the current XML libraries don't seem to provide a way to construct XML trees that can be written to file. So WRITING graphML or gexf files doesn't seem possible right now (without a lot of work).

rsofaer commented 11 years ago

On the subject of GML, I've been thinking about writing a parser library that works with formal grammars. GML and DOT both provide CFGs, which makes parsing easy if you have a parser combinator.

The GML grammar is in here: http://www.fim.uni-passau.de/fileadmin/files/lehrstuhl/brandenburg/projekte/gml/gml-technical-report.pdf The DOT grammar is here: http://www.graphviz.org/doc/info/lang.html

Anyone have any experience or recommended reading for me if I do that?

nfoti commented 11 years ago

Pyparsing does this in python, lex and yacc are the standard tools in C. A wrapper for lex and yacc generated parsers could be possible.

Sent from my iPhone

On Jul 16, 2013, at 11:40 AM, Raphael Sofaer notifications@github.com wrote:

On the subject of GML, I've been thinking about writing a parser library that works with formal grammars. GML and DOT both provide CFGs, which makes parsing easy if you have a parser combinator.

The GML grammar is in here: http://www.fim.uni-passau.de/fileadmin/files/lehrstuhl/brandenburg/projekte/gml/gml-technical-report.pdf The DOT grammar is here: http://www.graphviz.org/doc/info/lang.html

Anyone have any experience or recommended reading for me if I do that?

— Reply to this email directly or view it on GitHub.

nfoti commented 11 years ago

I forgot how tedious it is to write parsers with lex/flex and yacc/bison. However, it does seem possible to generate a parser as a shared library with these tools and call them from Julia. That does seem like it would be very tedious to handle a new grammar and would require people to learn about lex and yacc.

The pyparsing library that I mentioned may be useful to look at. It provides constructs to specify the CFG in python code and to break input into tokens (it might provide more functionality as well, but this is as much of it as I've seen). The Python module networkx uses pyparsing to read and write gml files, so that might be worth taking a look at just to see it in action. However, implementing something similar to pyparsing in Julia would be a lot of work.

That programming language guys may have better suggestions than these.

ViralBShah commented 11 years ago

Why not just use the pyparsing library through PyCall?

nfoti commented 11 years ago

Using pyparsing through PyCall is an option. It just seemed a little unsatisfying to depend on Python for this. If people don't mind though we can also do the same thing to read and write graphML and gexf files using lxml (the Python wrapper for libxml2 which is much more complete right now). I can take a first pass at this if people think it is an acceptable solution for now.

Should we move this discussion over to a new repository GraphIO.jl who's purpose it is to deal with the common graph file formats?

ViralBShah commented 11 years ago

@lindahua If we just depend on PyCall.jl - would it be ok to have the IO stuff in Graphs.jl? I worry that we may end up fragmenting this nascent package too much if we also have GraphIO. In general, I think it is ok to depend on packages that are pure julia.

lindahua commented 11 years ago

Generally, I don't mind if things don't get very huge. People can explore IO stuff in this package.

This aside, I think a Parser package would be an important infrastructure -- a variety of domains can benefit from this (not only graphs). Currently, using PyCall may be a reasonable solution. In the long run, we should not rely on another language to do such things.

I think people who are interested in file parsing can create a Parser.jl package to start the exploration. If we want to get the job done soon, a temporary solution using PyCall is fine.

rsofaer commented 11 years ago

I think the question of whether to create a new repository will depend on the empirical increase in loading time. We should just develop in Graphs.jl and split it if loading time becomes uncomfortable.

I'm interested in creating pure Julia a Parser.jl package, so I'm going to try it. If the performance is terrible it can go unused.

On Tue, Jul 16, 2013 at 3:18 PM, Dahua Lin notifications@github.com wrote:

Generally, I don't mind if things don't get very huge. People can explore IO stuff in this package.

This aside, I think a Parser package would be an important infrastructure -- a variety of domains can benefit from this (not only graphs). Currently, using PyCall may be a reasonable solution. In the long run, we should not rely on another language to do such things.

I think people who are interested in file parsing can create a Parser.jl package to start the exploration. If we want to get the job done soon, a temporary solution using PyCall is fine.

— Reply to this email directly or view it on GitHubhttps://github.com/JuliaLang/Graphs.jl/issues/37#issuecomment-21066287 .

ViralBShah commented 11 years ago

Sounds like a plan!

sbromberger commented 9 years ago

Has any progress been made with respect to reading/writing graphs? If not,

1) is it ok if I try my hand at a GraphML interface? 2) is there any interest in a quick and dirty simple_graph persistence capability? (I've actually already created this - the question is whether I should a) scrap it in favor of a standard, b) include it along with a standard, and/or c) publish it now while working on the standard) 3) should this be a separate package or should I plan on integrating it into Graphs.jl? 4) Is there another format folks would recommend I start with, other than GraphML? (Is there one that would be more useful?)

Thanks!

timholy commented 9 years ago

Presumably there's room for a standard interface, but it's also worth pointing out that you could presumably use JLD for this. Depending on the storage format, you might have to tackle https://github.com/timholy/HDF5.jl/issues/85, but if nodes are not objects then I think this won't affect you. If it doesn't, it's likely you can just save("mygraph.jld", G) and it should Just Work.

sbromberger commented 9 years ago

it's also worth pointing out that you could presumably use JLD for this

That's actually a very neat idea, and I'd like to try that. My immediate use case is sharing graph information between Julia/Graphs.jl and Python/NetworkX so I can do equivalence and performance testing. I've written a Q&D persistence function (and, of course, its Python equivalent) for GraphCentrality, but it seemed to me that it'd be more appropriate to use some sort of standard graph storage, and I came across this old open issue.

Given that JLD is (I think?) Julia-only, unless there's a strong consensus that it should take priority, I'd prefer to write something that would interface among other libraries. What are your thoughts on that approach?

PS: I don't think we can count on nodes not being objects, in which case - if I understand https://github.com/timholy/HDF5.jl/issues/85 correctly - we run into that bug if there are cycles.

timholy commented 9 years ago

JLD writes in HDF5 format, and so in principle anything that can read HDF5 (Python, Matlab, many others) could open the files. But, that doesn't mean it's easy: any complex object is going to require annotation & metadata, and JLD defines its own format for that. Consequently, if cross-language compatibility is important to you, you're definitely better off adopting some standard.

Yes, if the graph is specified by Arrays of integers, you'd don't have to worry about cycles (because they are "interpreted cycles"), but if objects actually point to each other, then you have to face this. FYI it should not be very hard (use an ObjectIdDict to keep track of which objects have already been written), I've just never needed it and so haven't implemented it myself.

sbromberger commented 9 years ago

JLD works brilliantly, but the file sizes (even with compress=true) are a bit problematic for simple graphs (these are, respectively, a 500-node, 50k-edge and a 1m-node, 10m-edge randomly created graphs). Note that the JLD save of the 1m-10m graph ran for over an hour before I stopped it; it had written 1.1 gigs in that time:

-rw-r--r--  1 seth  staff  2982936 Jan 11 16:03 graph-500-50000-compressed.jld
-rw-r--r--   1 seth  staff  1159429095 Jan 11 17:09 graph-1m-10m-c.jld  (note: truncated)

vs

-rw-r--r--@ 1 seth  staff  161498 Jan  8 21:12 graph-500-50000.csv.gz
-rw-r--r--@ 1 seth  staff  69811801 Jan  8 23:06 graph-1m-10m.csv.gz

It might pay off for more complex graph types with large amounts of node and edge data, though.

Should further discussion about HDF5 / JLD continue here or over on the HDF5 issues list? (Is further discussion warranted?)

timholy commented 9 years ago

Sure, feel free to open an issue over there. FYI in principle it should be possible to do far better than csv; it may need a serialization, though. But for example:

julia> A = rand(10^6);

julia> @time writecsv("/tmp/test.csv", A);
elapsed time: 1.768478306 seconds (198428148 bytes allocated, 22.54% gc time)

julia> @time @save "/tmp/test.jld" A
elapsed time: 0.013694702 seconds (7572 bytes allocated)

There must be something about the graph format that makes it store lots of little objects, and make it less efficient. If, for instance, you packed the node/edge information in much the same way that a SparseMatrixCSC packs it, I suspect you'd get much faster performance.

jpfairbanks commented 9 years ago

@sbromberger: To address your question 3. I think that there will be enough formats and enough graph implementations that having this reading and writing infrastructure independent of specific implementation of the graph data structure will be useful. For instance LightGraphs.jl and GraphLayout.jl would also benefit from having a unified GraphIO.jl package. The graph formats are diverse so encompassing them all will take some good design.

Here are some lists of commonly supported graph formats:

From Gephi
- GEXF
- GDF
- GML
- GraphML
- Pajek NET
- GraphViz DOT
- CSV
- UCINET DL
- Tulip TPL
- Netdraw VNA
- Spreadsheet
GraphStream-Project.org
R's igraph
- edgelist
- pajek
- ncol
- lgl
- graphml
- dimacs
- graphdb
- gml
- dl
SNAP
Any Matrix format which contains the adjacency matrix or incidence matrix

nil-8 commented 7 years ago

I have written some code which can read simple Pajek files (currently only with Vertices and Edges headings). It shouldn't take a lot of effort to adapt it to the full Pajek standard. Is there any interest in this and how can I make this usefully available to other users? I am a completely github newbie.

dehann commented 7 years ago

Hi @nil-8, best practice is to fork the Graphs.jl repo to your local github, develop in a local topic branch and then do pull requests back upstream to this Graphs.jl repo. We keep the master branch on this repo running and clean for general use, but bug fixes and new features always welcome. Please keep the pull requests to Graphs.jl/master as isolated to a single bug or feature addition. There are automated tests using Travis which run at each pull request or update to master, so check

Pkg.test("Graphs")

locally by you before creating the pull request. I'll get to merging the requests as fast I can.

best, dehann

ViralBShah commented 7 years ago

Note that this package is no longer maintained, and LightGraphs.jl is what is best to target. Of course, this particular capability should be its own package and certainly welcome.

sbromberger commented 7 years ago

Just FYI - we have support for Pajek in LightGraphs already.

dehann commented 7 years ago

Hi @ViralBShah, please note some of us are using Graphs.jl because LightGraphs.jl does not cover all the use cases. I am going to help with Graphs as best I can. Graphs is running on 0.5, so I'm sure we can Keep it going till upcoming 1.0 with not too much effort.

dehann commented 7 years ago

Actually, one idea was to maybe make Graphs depend on LightGraphs and remove bad code from Graphs over time.

sbromberger commented 7 years ago

@dehann I'm happy to help out as far as I can, but given my experience with LG since 0.3, I think you may be underestimating (perhaps seriously) the amount of work it will take to keep Graphs.jl error-free across even minor versions. The 0.3- to 0.4 transition was a huge one, as was (to a lesser extent, perhaps) the 0.4- to 0.5 transition. Fundamental things broke, and I think it's safe to assume this will be the case until the Julia API stabilizes with 1.0.

JuliaAttic / OldGraphs.jl

Reading and writing graphs #37