LionWeb-io / lionweb-repository

Reference implementation of LionWeb repository
Apache License 2.0
2 stars 1 forks source link

Introducing MetaPointers Table #86

Closed ftomassetti closed 2 months ago

ftomassetti commented 2 months ago

With this Pull Request we introduce support for using a MetaPointers Table.

MetaPointers are immutable. We store them in a table, and never change them.

We assume that the number of different metapointers is limited, therefore we cache them.

For insert operations we aim to find all the metapointers we are referring to. We then want to ensure that all of them are already present in the MetaPointers Table. If that is not the case we insert the missing ones and write down their ids within the MetaPointers Table. We then use such ids when defining the main insertion queries.

For retrieval operations we just use joins.

Performance comparison on normal store and retrieve

To measure performance, I ran an application based on normal store operations (not bulk imports). The application reads files, invokes a parser, stores the AST, and retrieves it to check it is the same. Note that the application produces Kolasu trees, and it then converts them back and forth to LionWeb format. The application processes many files until a certain threshold is met.

With a threshold of 250,000 nodes, the entire application ran in 828704 ms when the code in main was used, in 1006888 ms with this PR (121% w.r.t. main).

The total time spent on inserting ASTs is: 684519 ms with the code in main 864692 ms with this PR (126% w.r.t. main)

The total time spent retrieving ASTs is: 133676 ms with the code in main 131838 ms with this PR (98% w.r.t. main)

So with this code we have more or less the same performance on retrieval and worse performance on insertion using the normal insert.

Performance comparison on bulk import

I then ran a variant of that application that only makes insertions and does not retrieve ASTs. It does that using the bulk import operation. It stores 500,000 nodes in batches of 100,000 nodes at the time, and it uses FlatBuffers as the format.

406349 ms with the code in main 265839 ms with the code from this PR (65% w.r.t. main)

In this case, we have a clear gain in performance. If we consider that the application performs other operations in addition to the bulk imports, the pure improvement in the performance of bulk imports is probably a bit higher.

These “performance tests” are intended just to provide a rough indication, and they should not be considered as definitely accurate.

ftomassetti commented 2 months ago

Some observations/questions:

  • There is only one meta-pointer table, combining classifier, property, containment and reference meta-pointers. Did you think about having separate tables for them all, or maybe an extra column to denote what kind of meta-pointer the row represents?

No, I did not think about that

This would make it possible e.g. to check whether a meta-pointer is used (incorrectly) for different type of meta-pointers. Not even sure we ever need this, but it's an idea.

I wonder if, during development, one could change a language so that a property becomes a containment without changing language version and if this may be a problem. Also, do we prohibit to have the same key used for a classifier and a feature? These are very much corner cases we can ignore, I think. Honestly I kept a single table because i) I did not think of having different tables ii) because I am lazy :D

  • Am I correct that the meta-pointer table never gets any row removed or updated, there are only insertions.

Yes

This does make sense, because meta-pointers can be temporarily not used. Also, I assume that the meta-pointer table might be pre-filled when we get to the point where we add language definitions to the repository.

Absolutely. Currently preloading the metapointers table is not required, but it would help. Right now we do a first call to get or insert metapointers in case we have not them in our cache. However, if we know the indexes of all the metapointers we need that call is skipped. If we preload the metapointers table, we will still need to do the call to read from the DB the index of each metapointer, but no insert will be made

The meta-pointer table does not participate in the history, because entries cannot be changed or deleted, so history is not useful. Is this correct?

Exactly

ftomassetti commented 2 months ago

I have updated the PR, as when discussing the corresponding PR on LW Java we agreed on changing some of the binary formats (splitting in two files, changing namespaces: nothing "substantial"), so I needed to reflect that here too

ftomassetti commented 2 months ago

Thank you @joswarmer ! I will then press the Merge button