SynBioDex / pySBOL3

Native python implementation of SBOL 3.0 specification
MIT License
37 stars 16 forks source link

Accelerate loading #381

Closed jakebeal closed 2 years ago

jakebeal commented 2 years ago

Loading large documents is currently very slow. We should diagnose it and determine if it can be accelerated.

Example large file: https://github.com/iGEM-Engineering/iGEM-distribution/blob/develop/distribution.nt

tcmitchell commented 2 years ago

I've been digging into this. The test file @jakebeal provided is very useful. It takes in the neighborhood of 7 minutes to load on my development machine. Looking only at Document._parse_graph, it takes about 3 seconds to create the objects and assign their attributes. After that, it takes a whopping 6 minutes and 38 seconds finding orphans.

What are orphans? Orphans are objects in a SBOL3 file that are not part of the SBOL3 hierarchy. An SBOL3 document consists of Top Level objects, dependent or child objects, and orphans. A Document maintains a list of orphans so they are written out to file when the Document is serialized. They are otherwise not used by pySBOL3. In this particular document there are no orphans. So the 6+ minute search is fruitless. But it must be done so that documents are properly round-tripped. Otherwise these objects will be lost on serialization.

Caching doesn't have an impact because the list of URIs used in the search are by definition unique. At the top level we don't search for the same URI twice. There is effectively nothing to cache in the orphan search phase.

A couple of possibilities come to mind, and I'll be exploring those. We may be able to take advantage of the hierarchical naming in typical SBOL3 documents to narrow where we search for URIs to see if they are orphans. And we may be able to narrow the set of possible orphan URIs for which we search.

tcmitchell commented 2 years ago

@jakebeal had the insight that we should not do orphan checks on objects that we have added to parents as child objects. The optimization was to keep a dictionary mapping URI -> object for all objects that are added to parent objects as owned (or child) objects. We have all of that information handy at the time we add the child object to the parent. This dictionary is now returned from Document._parse_attributes. It was simply a matter of recording the objects as they are assigned. Additionally, I added a similar dictionary containing the top level objects in the document, keyed by URI.

This gave us two dictionaries with URI keys. Now the search for orphans is greatly sped up by doing membership tests in those dictionaries using the URI of the possible orphan. If those tests fail we do the full search through the document. With this optimization in place, document load time decreases by more than 90%. A file that used to take 7+ minutes to load now takes 7 seconds to load.