JuliaGraphs / GraphIO.jl

Graph IO functionality for various formats.
Other
61 stars 28 forks source link

Not sure if mis-using or bug: ERROR: InexactError() #6

Closed nickeubank closed 7 years ago

nickeubank commented 7 years ago

Trouble loading graphml. File has vertex attributes, which may be cause of problem?

julia> f = open("/users/nick/desktop/anon_voz6_sms18_0.75_months.graphml", "r")
IOStream(<file /users/nick/desktop/anon_voz6_sms18_0.75_months.graphml>)

julia> g = loadgraph(f, "graph", GraphMLFormat())
ERROR: InexactError()
Stacktrace:
 [1] macro expansion at /Users/Nick/.julia/v0.6/EzXML/src/error.jl:50 [inlined]
 [2] parsexml(::String) at /Users/Nick/.julia/v0.6/EzXML/src/document.jl:91
 [3] loadgraphml(::IOStream, ::String) at /Users/Nick/.julia/v0.6/GraphIO/src/graphml.jl:30
 [4] loadgraph(::IOStream, ::String, ::GraphIO.GraphMLFormat) at /Users/Nick/.julia/v0.6/GraphIO/src/graphml.jl:114
sbromberger commented 7 years ago

~yes, you're misusing~ :) What happens if you just do

g = loadgraph("/users/nick/desktop/anon_voz6_sms18_0.75_months.graphml", "graph", GraphMLFormat())

?

Actually, you should be able to pass an IO object in there. Can you share your graphml file?

nickeubank commented 7 years ago

Hmmm...

screen shot 2017-07-21 at 2 13 08 pm

Can't share that file (both due to size and data sharing agreement), but will see if I can make minimal replicable version in coming days.

FWIW, using iGraph in python, read in and exported as edgelist and works without problem.

sbromberger commented 7 years ago

FWIW, using iGraph in python, read in and exported as edgelist and works without problem.

Them's fightin' words.

(I'll see if I can't figure out the problem here, but I need a MWE.)

nickeubank commented 7 years ago

Will do in coming days. Gotta run now.

nickeubank commented 7 years ago

Working on MWE -- a small sample from that graph (just 3 random nodes) doesn't cause problems:

<?xml version="1.0" encoding="UTF-8"?> <graphml xmlns="http://graphml.graphdrawing.org/xmlns" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://graphml.graphdrawing.org/xmlns http://graphml.graphdrawing.org/xmlns/1.0/graphml.xsd">

  <key id="v_id" for="node" attr.name="id" attr.type="string"/>
  <key id="v_pt" for="node" attr.name="pt" attr.type="boolean"/>
  <graph id="G" edgedefault="undirected">
    <node id="n0">
      <data key="v_id">n3432</data>
      <data key="v_pt">false</data>
    </node>
    <node id="n1">
      <data key="v_id">n3433</data>
      <data key="v_pt">true</data>
    </node>
    <node id="n2">
      <data key="v_id">n3434</data>
      <data key="v_pt">false</data>
    </node>
    <edge source="n0" target="n1">
    </edge>
    <edge source="n1" target="n2">
    </edge>
  </graph>
</graphml>

Loads fine:

julia> a=loadgraph("/users/nick/desktop/mwe.graphml", "G",  GraphMLFormat())
WARNING: Skipping unknown XML element 'key'
WARNING: Skipping unknown XML element 'key'
{3, 2} undirected simple Int64 graph

It's a huge graph -- 10 million nodes? -- could that be related?

sbromberger commented 7 years ago

It's a huge graph -- 10 million nodes? -- could that be related?

Maybe, but I doubt it. If we can get a MWE that fails, I can see what's going on. Sorry that it's not working.

nickeubank commented 7 years ago

OK, I think I got it -- I think it's a size issue. A file over 2gb causes an overflow in the EzXML library.

If you want a MWE, run this in python (not worth uploading a 2gb file is it?):

import igraph as ig
import os
os.chdir('/users/nick/desktop')
g = ig.Graph.Erdos_Renyi(n=10000000, m=200000000)
g.vs['astring'] = 'test_string'
g.vs['abool'] = False
g.write('mwe_2.graphml', format='graphml')

Reading that file in gets:

julia> g = loadgraph("/users/nick/desktop/mwe_2.graphml", "G", GraphMLFormat())
ERROR: InexactError()
Stacktrace:
 [1] macro expansion at /Users/Nick/.julia/v0.6/EzXML/src/error.jl:50 [inlined]
 [2] parsexml(::String) at /Users/Nick/.julia/v0.6/EzXML/src/document.jl:91
 [3] loadgraphml(::GZip.GZipStream, ::String) at /Users/Nick/.julia/v0.6/GraphIO/src/graphml.jl:30
 [4] gzopen(::LightGraphs.##109#110{String,GraphIO.GraphMLFormat}, ::String, ::String) at /Users/Nick/.julia/v0.6/GZip/src/GZip.jl:268
 [5] loadgraph(::String, ::String, ::GraphIO.GraphMLFormat) at /Users/Nick/.julia/v0.6/LightGraphs/src/persistence/common.jl:14

@bicycle1885 kindly suggests the way to fix may be:

I think replacing xdoc = parsexml(readstring(io)) with xdoc = readxml(io) in https://github.com/JuliaGraphs/GraphIO.jl/blob/master/src/graphml.jl will solve the problem.

sbromberger commented 7 years ago

Ah, the InexactError sort of makes sense in that context. Can you try replacing that line and seeing if it works? If it does then you can submit a quick PR and I can merge it today.

nickeubank commented 7 years ago

kind of... It won't accept a file path string it looks like:

julia> g = loadgraph("/users/nick/desktop/mwe_2.graphml", "G", GraphMLFormat())
ERROR: MethodError: no method matching nb_available(::GZip.GZipStream)
Closest candidates are:
  nb_available(::Base.Filesystem.File) at filesystem.jl:162
  nb_available(::BufferStream) at stream.jl:1155
  nb_available(::IOStream) at iostream.jl:185
  ...
Stacktrace:
 [1] (::EzXML.##9#10)(::Ptr{Void}, ::Ptr{UInt8}, ::Int32) at /Users/Nick/.julia/v0.6/EzXML/src/document.jl:225
 [2] macro expansion at /Users/Nick/.julia/v0.6/EzXML/src/error.jl:50 [inlined]
 [3] readxml(::GZip.GZipStream) at /Users/Nick/.julia/v0.6/EzXML/src/document.jl:163
 [4] loadgraphml(::GZip.GZipStream, ::String) at /Users/Nick/.julia/v0.6/GraphIO/src/graphml.jl:30
 [5] gzopen(::LightGraphs.##109#110{String,GraphIO.GraphMLFormat}, ::String, ::String) at /Users/Nick/.julia/v0.6/GZip/src/GZip.jl:268
 [6] loadgraph(::String, ::String, ::GraphIO.GraphMLFormat) at /Users/Nick/.julia/v0.6/LightGraphs/src/persistence/common.jl:14

But it'll take a file handle I think. I'm running out of ram before finishes, but I'm getting to the point of it doing parsing. Can't check for sure on main work computer with lots of ram for a week, but I think this is sufficient to show working. Though wish it didn't have so much memory overhead!:

julia> f = open("/users/nick/desktop/mwe_2.graphml", "r")
IOStream(<file /users/nick/desktop/mwe_2.graphml>)

julia> g = loadgraph(f, "G", GraphMLFormat())
WARNING: Skipping unknown XML element 'key'
WARNING: Skipping unknown XML element 'key'

Haven't wrestled much with IO in Julia yet -- there a convenience function for this kind of thing that takes "io" in lots of forms and returns an IOStream?

sbromberger commented 7 years ago

I'm fixing the persistence logic in LightGraphs so that the individudal loadgraph functions don't require IOStreams. This will take a while.

nickeubank commented 7 years ago

ugh, sorry! thanks for looking into it.

bicycle1885 commented 7 years ago

If you rewrite the current GraphML parser, I highly recommend using streaming APIs. It will significantly reduce memory consumption and be faster.

sbromberger commented 7 years ago

@bicycle1885 yeah, this is probably something we should do. I played around with the stream reader yesterday and am still unsure the best way to proceed. If you have time, perhaps you could show some skeleton code based on the existing parser?

bicycle1885 commented 7 years ago

I made two pull requests shown above.

sbromberger commented 7 years ago

@bicycle1885 does the GraphIO PR require CodecZlib? If so we need to bump the LightGraphs dependency in REQUIRES to 0.9.5.

sbromberger commented 7 years ago

@nickeubank - can you retry graphml load now that #8 has been merged? (Make sure you check out the latest master.) If it works I can go ahead and tag a new version of GraphIO.

@bicycle1885 - thank you very much for #8 !

nickeubank commented 7 years ago

Thanks @bicycle1885 for the PR!

@sbromberger looks good. Loaded with string file path (not IO object), and no memory explosion like last time (when hit 30gb to load a 2gb file)!

It did print out the warning:

WARNING: Skipping unknown node 'data'

endlessly (I assume 20 million times -- twice for each node), BUT none of that if first order. :)

Thanks all!

sbromberger commented 7 years ago

That's actually expected. We don't parse the data nodes, and warn about it. You can comment out these lines 1 2 3 to suppress the warnings.

sbromberger commented 7 years ago

Anyway - now that this is working, I'll close this out - but feel free to open up a new issue if there's something else that needs fixing.

bicycle1885 commented 7 years ago

@sbromberger

does the GraphIO PR require CodecZlib? If so we need to bump the LightGraphs dependency in REQUIRES to 0.9.5.

Yes, it would be needed to support reading gzip files. That's why I replaced Gzip.jl with CodecZlib.jl on LightGraphs.jl.

The GraphML parser I implemented is very naive because I'm almost ignorant of the file format. I recommend you to verify & modify it with more realistic data sets before releasing it.