digitalbazaar / pyld

JSON-LD processor written in Python
https://json-ld.org/
Other
606 stars 131 forks source link

`JsonLdProcessor._compare_rdf_triples()` is a massive performance hog in `parse_nquads` #169

Open RinkeHoekstra opened 2 years ago

RinkeHoekstra commented 2 years ago

When building a dataset from N-Quads, the JsonLdProcessor checks for every triple whether it is unique. This is done through a pairwise comparison in JsonLdProcessor._compare_rdf_triples()

This means that the triples being compared grows exponentially with the size of the dataset (or at least, the graph).

https://github.com/digitalbazaar/pyld/blob/316fbc2c9e25b3cf718b4ee189012a64b91f17e7/lib/pyld/jsonld.py#L1634

To give some metrics, for a 14k line N-Quads file, all in a single graph, the time drops from 18.8s with on my M1 mac to 0.7s without comparison.

Given the limited occurrence and impact of duplicate triples/quads in N-Quads files, this is really way too expensive.

At the very least, the parser could build an index (HashMap or dict) to speed up this comparison; but given that the JSON-LD builder that usually follows this step does this too, the entire comparison could be dropped as a whole.