When building a dataset from N-Quads, the JsonLdProcessor checks for every triple whether it is unique. This is done through a pairwise comparison in JsonLdProcessor._compare_rdf_triples()
This means that the triples being compared grows exponentially with the size of the dataset (or at least, the graph).
To give some metrics, for a 14k line N-Quads file, all in a single graph, the time drops from 18.8s with on my M1 mac to 0.7s without comparison.
Given the limited occurrence and impact of duplicate triples/quads in N-Quads files, this is really way too expensive.
At the very least, the parser could build an index (HashMap or dict) to speed up this comparison; but given that the JSON-LD builder that usually follows this step does this too, the entire comparison could be dropped as a whole.
When building a dataset from N-Quads, the
JsonLdProcessor
checks for every triple whether it is unique. This is done through a pairwise comparison inJsonLdProcessor._compare_rdf_triples()
This means that the triples being compared grows exponentially with the size of the dataset (or at least, the graph).
https://github.com/digitalbazaar/pyld/blob/316fbc2c9e25b3cf718b4ee189012a64b91f17e7/lib/pyld/jsonld.py#L1634
To give some metrics, for a 14k line N-Quads file, all in a single graph, the time drops from 18.8s with on my M1 mac to 0.7s without comparison.
Given the limited occurrence and impact of duplicate triples/quads in N-Quads files, this is really way too expensive.
At the very least, the parser could build an index (HashMap or dict) to speed up this comparison; but given that the JSON-LD builder that usually follows this step does this too, the entire comparison could be dropped as a whole.