acl-org / acl-anthology

Data and software for building the ACL Anthology.
https://aclanthology.org
Apache License 2.0
432 stars 291 forks source link

Faster loading of Anthology class #835

Open mjpost opened 4 years ago

mjpost commented 4 years ago

Instantiating the Anthology class in the Python API takes 30 seconds or so, while all the files are loaded and parsed. This is inconvenient when testing. It would be nice if the expensive data structures could be build lazily, or cached in some manner.

mbollmann commented 4 years ago

Caching

I've tried adding serialization functionality to the Anthology class, so it could potentially be instantiated from a cached file instead of loading from the XML/YAML. This adds quite a bit of overhead that also needs to be maintained if the classes change; you can see this in commit 7604717.

Results on my system are:

6-7 seconds are a lot less than 22 seconds, but not quite the performance I'd be hoping for. I've used msgpack as the data format and lz4 for fast compression; maybe there are faster options, but my timings suggest that instantiating the Python objects is the larger bottleneck here anyway.

Lazy loading

This could certainly be an alternative; however, I tried the caching route first as this would also cut down on regular build times (since the Anthology object is instantiated by multiple scripts in the process). Lazy loading, on the other hand, would only help with scripts outside the build chain (that don't need to access all the info) or while developing.

Access to anthology.volumes or anthology.papers could easily be made to lazily load the respective XML files (as long as their name can be uniquely inferred from the ID, which is currently the case). The same goes for anthology.venues and anthology.sigs, which are instantiated from YAML files.

The only thing that can't really be lazily instantiated is the person index (anthology.pindex), as accessing information about a given person requires loading in all the XML files to see where that name appears. A possible compromise could therefore be to pre-compute this index as an XML or YAML file, maybe? I'm not sure if that should be done as a file we keep in the repo (and update with a script when XML files change), or as a cached file managed by the AnthologyIndex.

mjpost commented 4 years ago

Thanks for this investigation!

Caching to improve the build time makes sense, but I worry about adding maintenance. It looks like you have to manually specify how to serialize and unserialize each object. Let's say I add a class member and then forget to add it to the serialization code—will the mistake be obvious?

My main interest was really speed for external use of the API, so the lazy loading is appealing. But why can't the person index be lazily loaded? I am thinking of use cases where I instantiate Anthology but then never access the person index. It would be nice not to have to pay that overhead until I try to create a person object.

I like the idea of lazy loading and caching the person index only, which would presumably add only a bit of complexity (versus caching for everything).

mbollmann commented 3 years ago

Some more thoughts (mainly for myself :)) and investigation:

As a first step, I now think the most promising route is to start optimizing the code we already have; profiling the code with pyinstrument gives some interesting insights (full profiler output here):

pyinstrument output for instantiating the Anthology class

This reveals that we spend 30% of the time instantiating the Anthology class – 7 seconds out of 23 total on my machine – on resolving names, so our current code to deal with this is super costly! There's also a lot of cumulative time going into building full Anthology IDs, formatting titles, and inferring URLs, which all seem like things that could probably be done lazily on-demand.

Second, to be able to compute more things lazily, I'd like to refactor the code to get rid of the universal .attrib dict with metadata that many objects hold. The way it's done now, the .attrib dict is filled with all kinds of information that is needed for building the Anthology website, but it's filled immediately upon parsing the objects from the XML. The build scripts then simply copy .attrib as a starting point. This effectively prevents lazily computing these attributes when needed, and also obscures in the build scripts what metadata attributes there actually are (something that already annoyed me in the past).

By that time, it's probably best to look at some profiler graphs again to see what bottlenecks remain and how to solve them.

mbollmann commented 3 years ago

Here's an updated report from pyinstrument after merging #1473 (caveat: I don't remember which device I ran the previous test on :)):

image

Besides already looking much faster, I suspect the refactoring the name handling (re #1530) might shave off another second or so.

mjpost commented 3 years ago

Fantastic—thanks for the report and the merge.