serialize function (serialize()) struggling with larger data/unoptimized for it

peterphilips commented 1 year ago

I've been attempting to create a graph with multiple millions of triples and serialize into a turtle file. However, the time it takes to serialize is very strange when attempting to utilize .serialize(destination = "file.ttl"). As an example, one file (which is 28.4 mb and has approx 450k lines/triples) took approximately 3.5 mins, however, something with a file 5x as large (142mb and 2.2 million lines/triples) took 86.5 mins. This was tested using basic python's time.time() function for timing before and after execution.

I would expect at the very least serialization would scale linearly, however it does not seem to at larger serializations. Additionally, it suggests to me there may be a problem with I/O calls. Perhaps this is not optimised or be making too many calls. In theory, serialization should only be limited by hardware. Any solutions/potential fixes would be great.

As a reference, here is running the serialize code with the 28.4 mb file on a profiler (cProfile)

2151663231 function calls (2151492056 primitive calls) in 493.452 seconds

Ordered by: cumulative time List reduced from 332 to 15 due to restriction <15>

ncalls tottime percall cumtime percall filename:lineno(function) 1 0.001 0.001 493.452 493.452 graph.py:1137(serialize) 1 0.067 0.067 493.427 493.427 turtle.py:226(serialize) 1844631 2.419 0.000 479.026 0.000 turtle.py:272(getQName) 1659054 0.781 0.000 473.365 0.000 graph.py:1047(compute_qname) 1659054 2.130 0.000 472.393 0.000 init.py:489(compute_qname) 119998 126.341 0.001 440.080 0.004 init.py:842(get_longest_namespace) 1 0.197 0.197 383.829 383.829 turtle.py:100(preprocess) 400029 1.088 0.000 382.030 0.001 turtle.py:257(preprocessTriple) 955226858 239.332 0.000 314.192 0.000 term.py:223(startswith) 58983 0.051 0.000 108.685 0.002 turtle.py:318(statement) 58983 0.101 0.000 108.576 0.002 turtle.py:322(s_default) 838640/776999 0.382 0.000 104.461 0.000 turtle.py:337(path) 828239 0.507 0.000 103.381 0.000 turtle.py:344(p_default) 828239 1.219 0.000 101.938 0.000 turtle.py:350(label) 1005472237 79.328 0.000 79.328 0.000 {method 'startswith' of 'str' objects}

Cheers,

Peter

ghost commented 1 year ago

I'm not sure that I agree with you in the expectation that Turtle serialization time should scale linearly because, unlike Ntriples, there isn't a straight triple-to-io relationship, there are lists and qnames to be handled ¹.

I used SP²Bench to generate some datasets to investigate the turtle serialization wall-clock times (in nanoseconds) for Jena, RDFLib and pymantic:

triples	Jena	RDFLib	pymantic
20ktriples	56089250	754599581	218089092
40ktriples	125826657	1414003788	506332458
80ktriples	384700945	2959297056	873459954
160ktriples	408564609	5659717174	1875179030
320ktriples	721129630	11587010883	3427364813
640ktriples	1039394962	24082981725	7387675250

I see the same profile as you do but I don't think it suggests that i/o issues are kicking in as it scales, just a lot of qname processing which becomes more significant as the size of the graph increases:

20k triples

$ python3 test_turtle_serializer_profiling.py 
         3560973 function calls (3548620 primitive calls) in 1.639 seconds

   Ordered by: internal time, cumulative time, primitive call count
   List reduced from 273 to 14 due to restriction <0.05>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   365566    0.136    0.000    0.136    0.000 term.py:160(__eq__)
    86550    0.112    0.000    0.685    0.000 plugins/serializers/turtle.py:281(getQName)
    23894    0.111    0.000    0.157    0.000 namespace/__init__.py:789(split_uri)
    66619    0.088    0.000    0.365    0.000 namespace/__init__.py:489(compute_qname)
    41997    0.081    0.000    0.423    0.000 plugins/serializers/turtle.py:359(label)
    20342    0.058    0.000    0.590    0.000 plugins/serializers/turtle.py:266(preprocessTriple)
    42794    0.050    0.000    0.130    0.000 plugins/serializers/turtle.py:197(addNamespace)
    45774    0.047    0.000    0.181    0.000 plugins/stores/memory.py:422(triples)
   282778    0.047    0.000    0.092    0.000 {built-in method builtins.isinstance}
    93138    0.040    0.000    0.066    0.000 plugins/serializers/turtle.py:171(write)
    45774    0.036    0.000    0.235    0.000 graph.py:2237(triples)
    40684    0.034    0.000    0.071    0.000 plugins/stores/memory.py:658(__get_context_for_triple)
4048/3008    0.032    0.000    0.827    0.000 plugins/serializers/turtle.py:430(predicateList)
    66619    0.032    0.000    0.397    0.000 graph.py:460(compute_qname)

640k triples

$ python3 test_turtle_serializer_profiling.py 
         118398209 function calls (118066889 primitive calls) in 48.427 seconds

   Ordered by: internal time, cumulative time, primitive call count
   List reduced from 273 to 14 due to restriction <0.05>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
 12025702    4.244    0.000    4.244    0.000 term.py:160(__eq__)
   732739    3.369    0.000    4.776    0.000 namespace/__init__.py:789(split_uri)
  2735998    3.218    0.000   20.349    0.000 plugins/serializers/turtle.py:281(getQName)
  2077227    2.468    0.000   11.298    0.000 namespace/__init__.py:489(compute_qname)
  1318032    2.241    0.000   11.910    0.000 plugins/serializers/turtle.py:359(label)
   640052    1.679    0.000   18.084    0.000 plugins/serializers/turtle.py:266(preprocessTriple)
  1344566    1.473    0.000    3.792    0.000 plugins/serializers/turtle.py:197(addNamespace)
  8886367    1.372    0.000    2.565    0.000 {built-in method builtins.isinstance}
  1426958    1.355    0.000    5.179    0.000 plugins/stores/memory.py:422(triples)
  2903079    1.156    0.000    2.042    0.000 plugins/serializers/turtle.py:171(write)
  1426958    0.972    0.000    6.618    0.000 graph.py:2237(triples)
  1280104    0.960    0.000    2.065    0.000 plugins/stores/memory.py:658(__get_context_for_triple)
  2776053    0.937    0.000    0.937    0.000 term.py:205(__lt__)
   640052    0.874    0.000    1.868    0.000 plugins/serializers/turtle.py:106(preprocessTriple)

pymantic uses a different approach to implementing curie/qname processing but it remains a significant component, as evidenced by its profile:

640k triples

$ python test/test_wip.py 
         36372495 function calls (36244385 primitive calls) in 17.507 seconds

   Ordered by: internal time, cumulative time, primitive call count
   List reduced from 158 to 8 due to restriction <0.05>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    2.391    2.391   17.473   17.473 pymantic/serializers.py:124(serialize_turtle)
  1208636    2.355    0.000    3.185    0.000 pymantic/primitives.py:62(parse_curie)
   867752    1.799    0.000    2.469    0.000 pymantic/primitives.py:101(to_curie)
  1371426    1.372    0.000    7.702    0.000 pymantic/serializers.py:76(turtle_repr)
  2771685    1.310    0.000    2.043    0.000 pymantic/primitives.py:426(match)
   760207    0.810    0.000    3.930    0.000 pymantic/rdf.py:569(is_list)
   261898    0.789    0.000    1.256    0.000 pymantic/serializers.py:67(turtle_string_escape)
   128547    0.528    0.000    0.756    0.000 pymantic/serializers.py:60(escape_prefix_local)

Any solutions/potential fixes would be great.

All I can suggest is, pro tem, serialize as NTriples and use Jena's riot command-line utility to re-serialize as Turtle. Here's the bash script I used to establish the Jena serialization times:

#!/bin/bash
for i in \
    20ktriples.nt \
    40ktriples.nt \
    80ktriples.nt \
    160ktriples.nt \
    320ktriples.nt \
    640ktriples.nt
do
# save time taken to parse
t0=`date +%s%N`
~/jena/bin/riot --sink $i
t1=`date +%s%N`
t2=`date +%s%N`
~/jena/bin/riot --output=Turtle $i > /tmp/dump
t3=`date +%s%N`
# subtract time taken to parse
tx=$(($t1 - $t0))
ty=$(($t3 - $t2))
echo "${i%%.nt},$(($ty - $tx))"
done

¹ Yeah, okay, Jena's performance on this is sobering but it was developed by a mature and very highly-skilled team, some being ex-colleagues of mine from HPLabs in Bristol.

peterphilips commented 1 year ago

@gjhiggins Thanks for this detailed response. I will attempt the jena work around pro tem.

Are there any recommendations in optimising the qname processing - I can see in the long term optimising this can prove to be significant.

Cheers,

Peter

ghost commented 1 year ago

Are there any recommendations in optimising the qname processing - I can see in the long term optimising this can prove to be significant.

Not that I can see straight away. The getQName call is crucial for compactness (pymantic seems to skip it and just outputs full URIs, afaict) and is eventually resolved in NamespaceManager which has received attention w.r.t. performance and as this gprof2dot¹ image shows, the computational effort is fairly well spread:

output

¹

$ python -m cProfile -o test.pstats test.py
$ gprof2dot -f pstats test.pstats | dot -Tpng -o output.png && eog output.png

tgbugs commented 1 year ago

The qname processing already uses a trie internally and is about as optimal as you can get. If you are converting to ttl from another format that does not populate qnames at parse time it will probably be slow because it will have to compute all the qnames when serializing. If you know the qnames in advance you might be able to prepopulate the trie, but looking at the profiling it seems like someone might have mucked about with the serialization code lately and caused a regression?

RDFLib / rdflib