Speed up parsing of annotations (XMLs)

leandervaneekelen commented 3 weeks ago

Hi @martvanrijthoven,

I have some huge XML files in which I store my slide-level detection inference (>100.000 cells/detections). I noticed that loading these from ASAP-formatted XMLs can take a very long time and that loading them from JSONs takes about equally as long (I was under the impression that converting the XMLs to JSON via scripts/convert_asapxml_to_json.py would result in a speed-up, but perhaps not at the scales I am working at).

Here is a code snippet for making a large annotations file:

# Generate random whole slide annotation file
import random
from shapely.geometry import Point
from wholeslidedata.annotation.types import PointAnnotation
from wholeslidedata.annotation.labels import Label
from wholeslidedata.interoperability.asap.annotationwriter import write_asap_annotation

seed = 0
random.seed(seed) # Fixed seed

n_points = int(1e5)
points = ((Point(random.randint(0, 100), random.randint(0, 100))) for _ in range(n_points))
label = Label("0", 0)
annotations = [PointAnnotation(i, label, p) for i, p in enumerate(points)]
write_asap_annotation(annotations, "/tmp/random_annotations.xml")

I did some profiling and found out that during the initiation of the WholeSlideAnnotation object a significant time is spent on inserting points into the rtree used for WholeSlideAnnotation.select_annotation calls:

from wholeslidedata.annotation.wholeslideannotation import WholeSlideAnnotation
import cProfile
import pstats
from pstats import SortKey
cProfile.run('WholeSlideAnnotation("/tmp/random_annotations.xml")', filename='profile_baseline')
p = pstats.Stats('./profile_baseline')
p.strip_dirs().sort_stats(SortKey.TIME).print_stats()


         14714081 function calls (14714075 primitive calls) in 27.252 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   161690   11.174    0.000   14.100    0.000 index.py:415(insert)
   161690    2.093    0.000    2.130    0.000 types.py:76(__init__)
   161690    1.354    0.000    1.438    0.000 point.py:196(geos_point_from_py)
   161690    1.178    0.000    1.939    0.000 index.py:339(get_coordinate_pointers)
   161690    0.829    0.000    1.894    0.000 coords.py:69(__getitem__)
   323380    0.518    0.000    0.555    0.000 coords.py:44(_update)
   323380    0.511    0.000    0.723    0.000 index.py:1558(get_dimension)
   161690    0.492    0.000    4.250    0.000 types.py:128(bounds)
   161690    0.436    0.000    0.581    0.000 predicates.py:23(__call__)
   161691    0.387    0.000    0.538    0.000 index.py:1537(get_index_type)
        1    0.379    0.379   18.730   18.730 selector.py:21(__init__)
   161690    0.379    0.000    0.632    0.000 coords.py:48(__len__)
   485070    0.364    0.000    0.364    0.000 base.py:224(empty)
   161690    0.360    0.000    0.360    0.000 {built-in method numpy.array}
        1    0.344    0.344    7.294    7.294 parser.py:101(parse)
   323380    0.321    0.000    0.321    0.000 base.py:69(geometry_type_name)
   161690    0.321    0.000    1.010    0.000 base.py:696(is_empty)
        6    0.306    0.051    0.333    0.056 parser.py:108(<listcomp>)

With some digging I found out that you can instantiate an rtree.index from a stream and that this offers a significant speed-up:

Thu Oct 24 14:26:58 2024    [./profile](https://file+.vscode-resource.vscode-cdn.net/home/leander/Desktop/profile)

         13743958 function calls (13743952 primitive calls) in 13.411 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   161690    1.269    0.000    1.346    0.000 point.py:196(geos_point_from_py)
   161690    1.094    0.000    1.111    0.000 types.py:76(__init__)
   161691    0.924    0.000    5.288    0.000 index.py:1239(py_next_item)
   161690    0.671    0.000    1.519    0.000 coords.py:69(__getitem__)
        2    0.583    0.291    5.871    2.935 index.py:1410(__init__)
   161696    0.429    0.000    0.495    0.000 labels.py:25(__init__)
   323380    0.416    0.000    0.451    0.000 coords.py:44(_update)
   323381    0.406    0.000    0.406    0.000 __init__.py:506(cast)
        1    0.377    0.377    6.469    6.469 parser.py:101(parse)
   161690    0.353    0.000    3.281    0.000 types.py:128(bounds)
   161690    0.347    0.000    0.421    0.000 index.py:1182(deinterleave)
   323380    0.323    0.000    0.323    0.000 base.py:69(geometry_type_name)
   161690    0.315    0.000    0.438    0.000 predicates.py:23(__call__)
   161690    0.309    0.000    0.805    0.000 labels.py:133(_label_from_dict)
        6    0.304    0.051    0.330    0.055 parser.py:108(<listcomp>)
   161690    0.292    0.000    5.321    0.000 types.py:64(create)
   161690    0.288    0.000    0.507    0.000 coords.py:48(__len__)
   485070    0.283    0.000    0.283    0.000 base.py:224(empty)

Now, the biggest timesink is initiating GEOS objects in Shapely.

Some benchmarking with timeit confirms the optimization boost:

from wholeslidedata.annotation.wholeslideannotation import WholeSlideAnnotation
%timeit -r 5 -n 5 WholeSlideAnnotation("/tmp/random_annotations.xml")

original: 11.3 s ± 457 ms per loop (mean ± std. dev. of 5 runs, 5 loops each)
optimized: 6.71 s ± 136 ms per loop (mean ± std. dev. of 5 runs, 5 loops each)

Cool crisp 40% speedup :sunglasses:

Let me know what you think! Btw, I read that shapely 2.0 contains a lot of optimizations, so we can probably get this time down even further. Do you know what would be the biggest hurdle for moving to shapely 2.0 in WSD? Is this even feasible/desirable?

coveralls commented 3 weeks ago

Pull Request Test Coverage Report for Build 11500271529

Details

1 of 1 (100.0%) changed or added relevant line in 1 file are covered.
No unchanged relevant lines lost coverage.
Overall coverage decreased (-0.02%) to 72.611%

Totals
Change from base Build 11386276311:	-0.02%
Covered Lines:	2219
Relevant Lines:	3056

💛 - Coveralls

martvanrijthoven commented 3 weeks ago

Hi Leander,

This is amazing, very nice speedup. Thank you so much! I think there are no hurdles for moving to shapely2.0 and i think it should already work.

DIAGNijmegen / pathology-whole-slide-data