RDFLib / rdflib-sqlalchemy

RDFLib store using SQLAlchemy dbapi as back-end
Other
148 stars 34 forks source link

Working with huge graphs #62

Closed sharpaper closed 4 years ago

sharpaper commented 4 years ago

"huge" as in "too big to fit into memory". I don't think I've seen this mentioned in the readme, so please excuse me if it's a dumb question. But considering that rdflib works in-memory, I was wondering if this plugin also needs to load all the graph in memory in order to work.

mwatts15 commented 4 years ago

Great question, @sharpaper . rdflib-sqlalchemy can work with graphs larger than the size of available memory, but there are caveats. rdflib-sqlalchemy will allocate memory proportional to the number of triples you are adding at once with an addN, so it's possible to outstrip your memory capacity that way. Also, for a call to triples(), we allocate memory proportional to the number of triples returned.

As a work-around, for adding triples, it's somewhat inconvenient, but you can split up the inserts to chunks that do fit in memory. There isn't a great work-around to triples() memory usage other than finding queries with smaller result sets. That said, now that I see that memory usage is a problem, I don't think it would be too difficult to limit memory usage here. I created a couple of issues (#63 and #64) to address this.

sharpaper commented 4 years ago

There isn't a great work-around to triples() memory usage other than finding queries with smaller result sets

This makes sense, the client should limit the range of possible results. However have you considered Python iterators? It could be a nice addition.

mwatts15 commented 4 years ago

Using iterators isn't really the problem in this case. Rather, there's a dictionary that's accumulating triples that needn't do that. At least that's the first thing I see.

On Thu, May 14, 2020, 09:34 sharpaper notifications@github.com wrote:

There isn't a great work-around to triples() memory usage other than finding queries with smaller result sets

This makes sense, the client should limit the range of possible results. However have you considered Python iterators? It could be a nice addition.

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/RDFLib/rdflib-sqlalchemy/issues/62#issuecomment-628676286, or unsubscribe https://github.com/notifications/unsubscribe-auth/AALLFSEOJABJWON4NW7H4UDRRP6PDANCNFSM4NAOZR2Q .