Open peterphilips opened 1 year ago
I'm not sure that I agree with you in the expectation that Turtle serialization time should scale linearly because, unlike Ntriples, there isn't a straight triple-to-io relationship, there are lists and qnames to be handled ¹.
I used SP²Bench to generate some datasets to investigate the turtle serialization wall-clock times (in nanoseconds) for Jena, RDFLib and pymantic:
triples | Jena | RDFLib | pymantic |
---|---|---|---|
20ktriples | 56089250 | 754599581 | 218089092 |
40ktriples | 125826657 | 1414003788 | 506332458 |
80ktriples | 384700945 | 2959297056 | 873459954 |
160ktriples | 408564609 | 5659717174 | 1875179030 |
320ktriples | 721129630 | 11587010883 | 3427364813 |
640ktriples | 1039394962 | 24082981725 | 7387675250 |
I see the same profile as you do but I don't think it suggests that i/o issues are kicking in as it scales, just a lot of qname processing which becomes more significant as the size of the graph increases:
$ python3 test_turtle_serializer_profiling.py
3560973 function calls (3548620 primitive calls) in 1.639 seconds
Ordered by: internal time, cumulative time, primitive call count
List reduced from 273 to 14 due to restriction <0.05>
ncalls tottime percall cumtime percall filename:lineno(function)
365566 0.136 0.000 0.136 0.000 term.py:160(__eq__)
86550 0.112 0.000 0.685 0.000 plugins/serializers/turtle.py:281(getQName)
23894 0.111 0.000 0.157 0.000 namespace/__init__.py:789(split_uri)
66619 0.088 0.000 0.365 0.000 namespace/__init__.py:489(compute_qname)
41997 0.081 0.000 0.423 0.000 plugins/serializers/turtle.py:359(label)
20342 0.058 0.000 0.590 0.000 plugins/serializers/turtle.py:266(preprocessTriple)
42794 0.050 0.000 0.130 0.000 plugins/serializers/turtle.py:197(addNamespace)
45774 0.047 0.000 0.181 0.000 plugins/stores/memory.py:422(triples)
282778 0.047 0.000 0.092 0.000 {built-in method builtins.isinstance}
93138 0.040 0.000 0.066 0.000 plugins/serializers/turtle.py:171(write)
45774 0.036 0.000 0.235 0.000 graph.py:2237(triples)
40684 0.034 0.000 0.071 0.000 plugins/stores/memory.py:658(__get_context_for_triple)
4048/3008 0.032 0.000 0.827 0.000 plugins/serializers/turtle.py:430(predicateList)
66619 0.032 0.000 0.397 0.000 graph.py:460(compute_qname)
$ python3 test_turtle_serializer_profiling.py
118398209 function calls (118066889 primitive calls) in 48.427 seconds
Ordered by: internal time, cumulative time, primitive call count
List reduced from 273 to 14 due to restriction <0.05>
ncalls tottime percall cumtime percall filename:lineno(function)
12025702 4.244 0.000 4.244 0.000 term.py:160(__eq__)
732739 3.369 0.000 4.776 0.000 namespace/__init__.py:789(split_uri)
2735998 3.218 0.000 20.349 0.000 plugins/serializers/turtle.py:281(getQName)
2077227 2.468 0.000 11.298 0.000 namespace/__init__.py:489(compute_qname)
1318032 2.241 0.000 11.910 0.000 plugins/serializers/turtle.py:359(label)
640052 1.679 0.000 18.084 0.000 plugins/serializers/turtle.py:266(preprocessTriple)
1344566 1.473 0.000 3.792 0.000 plugins/serializers/turtle.py:197(addNamespace)
8886367 1.372 0.000 2.565 0.000 {built-in method builtins.isinstance}
1426958 1.355 0.000 5.179 0.000 plugins/stores/memory.py:422(triples)
2903079 1.156 0.000 2.042 0.000 plugins/serializers/turtle.py:171(write)
1426958 0.972 0.000 6.618 0.000 graph.py:2237(triples)
1280104 0.960 0.000 2.065 0.000 plugins/stores/memory.py:658(__get_context_for_triple)
2776053 0.937 0.000 0.937 0.000 term.py:205(__lt__)
640052 0.874 0.000 1.868 0.000 plugins/serializers/turtle.py:106(preprocessTriple)
pymantic uses a different approach to implementing curie/qname processing but it remains a significant component, as evidenced by its profile:
$ python test/test_wip.py
36372495 function calls (36244385 primitive calls) in 17.507 seconds
Ordered by: internal time, cumulative time, primitive call count
List reduced from 158 to 8 due to restriction <0.05>
ncalls tottime percall cumtime percall filename:lineno(function)
1 2.391 2.391 17.473 17.473 pymantic/serializers.py:124(serialize_turtle)
1208636 2.355 0.000 3.185 0.000 pymantic/primitives.py:62(parse_curie)
867752 1.799 0.000 2.469 0.000 pymantic/primitives.py:101(to_curie)
1371426 1.372 0.000 7.702 0.000 pymantic/serializers.py:76(turtle_repr)
2771685 1.310 0.000 2.043 0.000 pymantic/primitives.py:426(match)
760207 0.810 0.000 3.930 0.000 pymantic/rdf.py:569(is_list)
261898 0.789 0.000 1.256 0.000 pymantic/serializers.py:67(turtle_string_escape)
128547 0.528 0.000 0.756 0.000 pymantic/serializers.py:60(escape_prefix_local)
Any solutions/potential fixes would be great.
All I can suggest is, pro tem, serialize as NTriples and use Jena's riot
command-line utility to re-serialize as Turtle. Here's the bash script I used to establish the Jena serialization times:
#!/bin/bash
for i in \
20ktriples.nt \
40ktriples.nt \
80ktriples.nt \
160ktriples.nt \
320ktriples.nt \
640ktriples.nt
do
# save time taken to parse
t0=`date +%s%N`
~/jena/bin/riot --sink $i
t1=`date +%s%N`
t2=`date +%s%N`
~/jena/bin/riot --output=Turtle $i > /tmp/dump
t3=`date +%s%N`
# subtract time taken to parse
tx=$(($t1 - $t0))
ty=$(($t3 - $t2))
echo "${i%%.nt},$(($ty - $tx))"
done
¹ Yeah, okay, Jena's performance on this is sobering but it was developed by a mature and very highly-skilled team, some being ex-colleagues of mine from HPLabs in Bristol.
@gjhiggins Thanks for this detailed response. I will attempt the jena work around pro tem.
Are there any recommendations in optimising the qname processing - I can see in the long term optimising this can prove to be significant.
Cheers,
Peter
Are there any recommendations in optimising the qname processing - I can see in the long term optimising this can prove to be significant.
Not that I can see straight away. The getQName
call is crucial for compactness (pymantic seems to skip it and just outputs full URIs, afaict) and is eventually resolved in NamespaceManager which has received attention w.r.t. performance and as this gprof2dot
¹ image shows, the computational effort is fairly well spread:
¹
$ python -m cProfile -o test.pstats test.py
$ gprof2dot -f pstats test.pstats | dot -Tpng -o output.png && eog output.png
The qname processing already uses a trie internally and is about as optimal as you can get. If you are converting to ttl from another format that does not populate qnames at parse time it will probably be slow because it will have to compute all the qnames when serializing. If you know the qnames in advance you might be able to prepopulate the trie, but looking at the profiling it seems like someone might have mucked about with the serialization code lately and caused a regression?
I've been attempting to create a graph with multiple millions of triples and serialize into a turtle file. However, the time it takes to serialize is very strange when attempting to utilize .serialize(destination = "file.ttl"). As an example, one file (which is 28.4 mb and has approx 450k lines/triples) took approximately 3.5 mins, however, something with a file 5x as large (142mb and 2.2 million lines/triples) took 86.5 mins. This was tested using basic python's time.time() function for timing before and after execution.
I would expect at the very least serialization would scale linearly, however it does not seem to at larger serializations. Additionally, it suggests to me there may be a problem with I/O calls. Perhaps this is not optimised or be making too many calls. In theory, serialization should only be limited by hardware. Any solutions/potential fixes would be great.
As a reference, here is running the serialize code with the 28.4 mb file on a profiler (cProfile)
2151663231 function calls (2151492056 primitive calls) in 493.452 seconds
Ordered by: cumulative time List reduced from 332 to 15 due to restriction <15>
ncalls tottime percall cumtime percall filename:lineno(function) 1 0.001 0.001 493.452 493.452 graph.py:1137(serialize) 1 0.067 0.067 493.427 493.427 turtle.py:226(serialize) 1844631 2.419 0.000 479.026 0.000 turtle.py:272(getQName) 1659054 0.781 0.000 473.365 0.000 graph.py:1047(compute_qname) 1659054 2.130 0.000 472.393 0.000 init.py:489(compute_qname) 119998 126.341 0.001 440.080 0.004 init.py:842(get_longest_namespace) 1 0.197 0.197 383.829 383.829 turtle.py:100(preprocess) 400029 1.088 0.000 382.030 0.001 turtle.py:257(preprocessTriple) 955226858 239.332 0.000 314.192 0.000 term.py:223(startswith) 58983 0.051 0.000 108.685 0.002 turtle.py:318(statement) 58983 0.101 0.000 108.576 0.002 turtle.py:322(s_default) 838640/776999 0.382 0.000 104.461 0.000 turtle.py:337(path) 828239 0.507 0.000 103.381 0.000 turtle.py:344(p_default) 828239 1.219 0.000 101.938 0.000 turtle.py:350(label) 1005472237 79.328 0.000 79.328 0.000 {method 'startswith' of 'str' objects}
Cheers,
Peter