Closed nickeubank closed 7 years ago
~yes, you're misusing~ :) What happens if you just do
g = loadgraph("/users/nick/desktop/anon_voz6_sms18_0.75_months.graphml", "graph", GraphMLFormat())
?
Actually, you should be able to pass an IO object in there. Can you share your graphml file?
Hmmm...
Can't share that file (both due to size and data sharing agreement), but will see if I can make minimal replicable version in coming days.
FWIW, using iGraph in python, read in and exported as edgelist and works without problem.
FWIW, using iGraph in python, read in and exported as edgelist and works without problem.
Them's fightin' words.
(I'll see if I can't figure out the problem here, but I need a MWE.)
Will do in coming days. Gotta run now.
Working on MWE -- a small sample from that graph (just 3 random nodes) doesn't cause problems:
<?xml version="1.0" encoding="UTF-8"?> <graphml xmlns="http://graphml.graphdrawing.org/xmlns" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://graphml.graphdrawing.org/xmlns http://graphml.graphdrawing.org/xmlns/1.0/graphml.xsd">
<key id="v_id" for="node" attr.name="id" attr.type="string"/>
<key id="v_pt" for="node" attr.name="pt" attr.type="boolean"/>
<graph id="G" edgedefault="undirected">
<node id="n0">
<data key="v_id">n3432</data>
<data key="v_pt">false</data>
</node>
<node id="n1">
<data key="v_id">n3433</data>
<data key="v_pt">true</data>
</node>
<node id="n2">
<data key="v_id">n3434</data>
<data key="v_pt">false</data>
</node>
<edge source="n0" target="n1">
</edge>
<edge source="n1" target="n2">
</edge>
</graph>
</graphml>
Loads fine:
julia> a=loadgraph("/users/nick/desktop/mwe.graphml", "G", GraphMLFormat())
WARNING: Skipping unknown XML element 'key'
WARNING: Skipping unknown XML element 'key'
{3, 2} undirected simple Int64 graph
It's a huge graph -- 10 million nodes? -- could that be related?
It's a huge graph -- 10 million nodes? -- could that be related?
Maybe, but I doubt it. If we can get a MWE that fails, I can see what's going on. Sorry that it's not working.
OK, I think I got it -- I think it's a size issue. A file over 2gb causes an overflow in the EzXML library.
If you want a MWE, run this in python (not worth uploading a 2gb file is it?):
import igraph as ig
import os
os.chdir('/users/nick/desktop')
g = ig.Graph.Erdos_Renyi(n=10000000, m=200000000)
g.vs['astring'] = 'test_string'
g.vs['abool'] = False
g.write('mwe_2.graphml', format='graphml')
Reading that file in gets:
julia> g = loadgraph("/users/nick/desktop/mwe_2.graphml", "G", GraphMLFormat())
ERROR: InexactError()
Stacktrace:
[1] macro expansion at /Users/Nick/.julia/v0.6/EzXML/src/error.jl:50 [inlined]
[2] parsexml(::String) at /Users/Nick/.julia/v0.6/EzXML/src/document.jl:91
[3] loadgraphml(::GZip.GZipStream, ::String) at /Users/Nick/.julia/v0.6/GraphIO/src/graphml.jl:30
[4] gzopen(::LightGraphs.##109#110{String,GraphIO.GraphMLFormat}, ::String, ::String) at /Users/Nick/.julia/v0.6/GZip/src/GZip.jl:268
[5] loadgraph(::String, ::String, ::GraphIO.GraphMLFormat) at /Users/Nick/.julia/v0.6/LightGraphs/src/persistence/common.jl:14
@bicycle1885 kindly suggests the way to fix may be:
I think replacing xdoc = parsexml(readstring(io)) with xdoc = readxml(io) in https://github.com/JuliaGraphs/GraphIO.jl/blob/master/src/graphml.jl will solve the problem.
Ah, the InexactError
sort of makes sense in that context. Can you try replacing that line and seeing if it works? If it does then you can submit a quick PR and I can merge it today.
kind of... It won't accept a file path string it looks like:
julia> g = loadgraph("/users/nick/desktop/mwe_2.graphml", "G", GraphMLFormat())
ERROR: MethodError: no method matching nb_available(::GZip.GZipStream)
Closest candidates are:
nb_available(::Base.Filesystem.File) at filesystem.jl:162
nb_available(::BufferStream) at stream.jl:1155
nb_available(::IOStream) at iostream.jl:185
...
Stacktrace:
[1] (::EzXML.##9#10)(::Ptr{Void}, ::Ptr{UInt8}, ::Int32) at /Users/Nick/.julia/v0.6/EzXML/src/document.jl:225
[2] macro expansion at /Users/Nick/.julia/v0.6/EzXML/src/error.jl:50 [inlined]
[3] readxml(::GZip.GZipStream) at /Users/Nick/.julia/v0.6/EzXML/src/document.jl:163
[4] loadgraphml(::GZip.GZipStream, ::String) at /Users/Nick/.julia/v0.6/GraphIO/src/graphml.jl:30
[5] gzopen(::LightGraphs.##109#110{String,GraphIO.GraphMLFormat}, ::String, ::String) at /Users/Nick/.julia/v0.6/GZip/src/GZip.jl:268
[6] loadgraph(::String, ::String, ::GraphIO.GraphMLFormat) at /Users/Nick/.julia/v0.6/LightGraphs/src/persistence/common.jl:14
But it'll take a file handle I think. I'm running out of ram before finishes, but I'm getting to the point of it doing parsing. Can't check for sure on main work computer with lots of ram for a week, but I think this is sufficient to show working. Though wish it didn't have so much memory overhead!:
julia> f = open("/users/nick/desktop/mwe_2.graphml", "r")
IOStream(<file /users/nick/desktop/mwe_2.graphml>)
julia> g = loadgraph(f, "G", GraphMLFormat())
WARNING: Skipping unknown XML element 'key'
WARNING: Skipping unknown XML element 'key'
Haven't wrestled much with IO in Julia yet -- there a convenience function for this kind of thing that takes "io" in lots of forms and returns an IOStream?
I'm fixing the persistence logic in LightGraphs so that the individudal loadgraph functions don't require IOStreams. This will take a while.
ugh, sorry! thanks for looking into it.
If you rewrite the current GraphML parser, I highly recommend using streaming APIs. It will significantly reduce memory consumption and be faster.
@bicycle1885 yeah, this is probably something we should do. I played around with the stream reader yesterday and am still unsure the best way to proceed. If you have time, perhaps you could show some skeleton code based on the existing parser?
I made two pull requests shown above.
@bicycle1885 does the GraphIO PR require CodecZlib? If so we need to bump the LightGraphs dependency in REQUIRES to 0.9.5.
@nickeubank - can you retry graphml load now that #8 has been merged? (Make sure you check out the latest master.) If it works I can go ahead and tag a new version of GraphIO.
@bicycle1885 - thank you very much for #8 !
Thanks @bicycle1885 for the PR!
@sbromberger looks good. Loaded with string file path (not IO object), and no memory explosion like last time (when hit 30gb to load a 2gb file)!
It did print out the warning:
WARNING: Skipping unknown node 'data'
endlessly (I assume 20 million times -- twice for each node), BUT none of that if first order. :)
Thanks all!
Anyway - now that this is working, I'll close this out - but feel free to open up a new issue if there's something else that needs fixing.
@sbromberger
does the GraphIO PR require CodecZlib? If so we need to bump the LightGraphs dependency in REQUIRES to 0.9.5.
Yes, it would be needed to support reading gzip files. That's why I replaced Gzip.jl with CodecZlib.jl on LightGraphs.jl.
The GraphML parser I implemented is very naive because I'm almost ignorant of the file format. I recommend you to verify & modify it with more realistic data sets before releasing it.
Trouble loading graphml. File has vertex attributes, which may be cause of problem?