RDFLib / rdflib

RDFLib is a Python library for working with RDF, a simple yet powerful language for representing information.
https://rdflib.readthedocs.org
BSD 3-Clause "New" or "Revised" License
2.14k stars 555 forks source link

serialize function (serialize()) struggling with larger data/unoptimized for it #2135

Open peterphilips opened 1 year ago

peterphilips commented 1 year ago

I've been attempting to create a graph with multiple millions of triples and serialize into a turtle file. However, the time it takes to serialize is very strange when attempting to utilize .serialize(destination = "file.ttl"). As an example, one file (which is 28.4 mb and has approx 450k lines/triples) took approximately 3.5 mins, however, something with a file 5x as large (142mb and 2.2 million lines/triples) took 86.5 mins. This was tested using basic python's time.time() function for timing before and after execution.

I would expect at the very least serialization would scale linearly, however it does not seem to at larger serializations. Additionally, it suggests to me there may be a problem with I/O calls. Perhaps this is not optimised or be making too many calls. In theory, serialization should only be limited by hardware. Any solutions/potential fixes would be great.

As a reference, here is running the serialize code with the 28.4 mb file on a profiler (cProfile)

2151663231 function calls (2151492056 primitive calls) in 493.452 seconds

Ordered by: cumulative time List reduced from 332 to 15 due to restriction <15>

ncalls tottime percall cumtime percall filename:lineno(function) 1 0.001 0.001 493.452 493.452 graph.py:1137(serialize) 1 0.067 0.067 493.427 493.427 turtle.py:226(serialize) 1844631 2.419 0.000 479.026 0.000 turtle.py:272(getQName) 1659054 0.781 0.000 473.365 0.000 graph.py:1047(compute_qname) 1659054 2.130 0.000 472.393 0.000 init.py:489(compute_qname) 119998 126.341 0.001 440.080 0.004 init.py:842(get_longest_namespace) 1 0.197 0.197 383.829 383.829 turtle.py:100(preprocess) 400029 1.088 0.000 382.030 0.001 turtle.py:257(preprocessTriple) 955226858 239.332 0.000 314.192 0.000 term.py:223(startswith) 58983 0.051 0.000 108.685 0.002 turtle.py:318(statement) 58983 0.101 0.000 108.576 0.002 turtle.py:322(s_default) 838640/776999 0.382 0.000 104.461 0.000 turtle.py:337(path) 828239 0.507 0.000 103.381 0.000 turtle.py:344(p_default) 828239 1.219 0.000 101.938 0.000 turtle.py:350(label) 1005472237 79.328 0.000 79.328 0.000 {method 'startswith' of 'str' objects}

Cheers,

Peter

ghost commented 1 year ago

I'm not sure that I agree with you in the expectation that Turtle serialization time should scale linearly because, unlike Ntriples, there isn't a straight triple-to-io relationship, there are lists and qnames to be handled ¹.

I used SP²Bench to generate some datasets to investigate the turtle serialization wall-clock times (in nanoseconds) for Jena, RDFLib and pymantic:

triples Jena RDFLib pymantic
20ktriples 56089250 754599581 218089092
40ktriples 125826657 1414003788 506332458
80ktriples 384700945 2959297056 873459954
160ktriples 408564609 5659717174 1875179030
320ktriples 721129630 11587010883 3427364813
640ktriples 1039394962 24082981725 7387675250

image


I see the same profile as you do but I don't think it suggests that i/o issues are kicking in as it scales, just a lot of qname processing which becomes more significant as the size of the graph increases:

20k triples

$ python3 test_turtle_serializer_profiling.py 
         3560973 function calls (3548620 primitive calls) in 1.639 seconds

   Ordered by: internal time, cumulative time, primitive call count
   List reduced from 273 to 14 due to restriction <0.05>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   365566    0.136    0.000    0.136    0.000 term.py:160(__eq__)
    86550    0.112    0.000    0.685    0.000 plugins/serializers/turtle.py:281(getQName)
    23894    0.111    0.000    0.157    0.000 namespace/__init__.py:789(split_uri)
    66619    0.088    0.000    0.365    0.000 namespace/__init__.py:489(compute_qname)
    41997    0.081    0.000    0.423    0.000 plugins/serializers/turtle.py:359(label)
    20342    0.058    0.000    0.590    0.000 plugins/serializers/turtle.py:266(preprocessTriple)
    42794    0.050    0.000    0.130    0.000 plugins/serializers/turtle.py:197(addNamespace)
    45774    0.047    0.000    0.181    0.000 plugins/stores/memory.py:422(triples)
   282778    0.047    0.000    0.092    0.000 {built-in method builtins.isinstance}
    93138    0.040    0.000    0.066    0.000 plugins/serializers/turtle.py:171(write)
    45774    0.036    0.000    0.235    0.000 graph.py:2237(triples)
    40684    0.034    0.000    0.071    0.000 plugins/stores/memory.py:658(__get_context_for_triple)
4048/3008    0.032    0.000    0.827    0.000 plugins/serializers/turtle.py:430(predicateList)
    66619    0.032    0.000    0.397    0.000 graph.py:460(compute_qname)

640k triples

$ python3 test_turtle_serializer_profiling.py 
         118398209 function calls (118066889 primitive calls) in 48.427 seconds

   Ordered by: internal time, cumulative time, primitive call count
   List reduced from 273 to 14 due to restriction <0.05>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
 12025702    4.244    0.000    4.244    0.000 term.py:160(__eq__)
   732739    3.369    0.000    4.776    0.000 namespace/__init__.py:789(split_uri)
  2735998    3.218    0.000   20.349    0.000 plugins/serializers/turtle.py:281(getQName)
  2077227    2.468    0.000   11.298    0.000 namespace/__init__.py:489(compute_qname)
  1318032    2.241    0.000   11.910    0.000 plugins/serializers/turtle.py:359(label)
   640052    1.679    0.000   18.084    0.000 plugins/serializers/turtle.py:266(preprocessTriple)
  1344566    1.473    0.000    3.792    0.000 plugins/serializers/turtle.py:197(addNamespace)
  8886367    1.372    0.000    2.565    0.000 {built-in method builtins.isinstance}
  1426958    1.355    0.000    5.179    0.000 plugins/stores/memory.py:422(triples)
  2903079    1.156    0.000    2.042    0.000 plugins/serializers/turtle.py:171(write)
  1426958    0.972    0.000    6.618    0.000 graph.py:2237(triples)
  1280104    0.960    0.000    2.065    0.000 plugins/stores/memory.py:658(__get_context_for_triple)
  2776053    0.937    0.000    0.937    0.000 term.py:205(__lt__)
   640052    0.874    0.000    1.868    0.000 plugins/serializers/turtle.py:106(preprocessTriple)

pymantic uses a different approach to implementing curie/qname processing but it remains a significant component, as evidenced by its profile:

640k triples

$ python test/test_wip.py 
         36372495 function calls (36244385 primitive calls) in 17.507 seconds

   Ordered by: internal time, cumulative time, primitive call count
   List reduced from 158 to 8 due to restriction <0.05>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    2.391    2.391   17.473   17.473 pymantic/serializers.py:124(serialize_turtle)
  1208636    2.355    0.000    3.185    0.000 pymantic/primitives.py:62(parse_curie)
   867752    1.799    0.000    2.469    0.000 pymantic/primitives.py:101(to_curie)
  1371426    1.372    0.000    7.702    0.000 pymantic/serializers.py:76(turtle_repr)
  2771685    1.310    0.000    2.043    0.000 pymantic/primitives.py:426(match)
   760207    0.810    0.000    3.930    0.000 pymantic/rdf.py:569(is_list)
   261898    0.789    0.000    1.256    0.000 pymantic/serializers.py:67(turtle_string_escape)
   128547    0.528    0.000    0.756    0.000 pymantic/serializers.py:60(escape_prefix_local)

Any solutions/potential fixes would be great.

All I can suggest is, pro tem, serialize as NTriples and use Jena's riot command-line utility to re-serialize as Turtle. Here's the bash script I used to establish the Jena serialization times:

#!/bin/bash
for i in \
    20ktriples.nt \
    40ktriples.nt \
    80ktriples.nt \
    160ktriples.nt \
    320ktriples.nt \
    640ktriples.nt
do
# save time taken to parse
t0=`date +%s%N`
~/jena/bin/riot --sink $i
t1=`date +%s%N`
t2=`date +%s%N`
~/jena/bin/riot --output=Turtle $i > /tmp/dump
t3=`date +%s%N`
# subtract time taken to parse
tx=$(($t1 - $t0))
ty=$(($t3 - $t2))
echo "${i%%.nt},$(($ty - $tx))"
done

¹ Yeah, okay, Jena's performance on this is sobering but it was developed by a mature and very highly-skilled team, some being ex-colleagues of mine from HPLabs in Bristol.

peterphilips commented 1 year ago

@gjhiggins Thanks for this detailed response. I will attempt the jena work around pro tem.

Are there any recommendations in optimising the qname processing - I can see in the long term optimising this can prove to be significant.

Cheers,

Peter

ghost commented 1 year ago

Are there any recommendations in optimising the qname processing - I can see in the long term optimising this can prove to be significant.

Not that I can see straight away. The getQName call is crucial for compactness (pymantic seems to skip it and just outputs full URIs, afaict) and is eventually resolved in NamespaceManager which has received attention w.r.t. performance and as this gprof2dot¹ image shows, the computational effort is fairly well spread:

output

¹

$ python -m cProfile -o test.pstats test.py
$ gprof2dot -f pstats test.pstats | dot -Tpng -o output.png && eog output.png
tgbugs commented 1 year ago

The qname processing already uses a trie internally and is about as optimal as you can get. If you are converting to ttl from another format that does not populate qnames at parse time it will probably be slow because it will have to compute all the qnames when serializing. If you know the qnames in advance you might be able to prepopulate the trie, but looking at the profiling it seems like someone might have mucked about with the serialization code lately and caused a regression?