Open Markus92 opened 1 week ago
This PR implements a label resolver in the harvester import stage to resolve URIs and find translations during data harvesting. The implementation focuses on resolving labels for specific fields using SKOS prefLabel, RDF Scheme Label, and Schema.org name properties. The code includes optimizations to prevent excessive database queries by filtering duplicates and only querying unresolved labels.
sequenceDiagram
participant Harvester
participant LabelResolver
participant ExternalURI
participant Database
Harvester->>LabelResolver: Request label resolution for URIs
LabelResolver->>ExternalURI: Fetch RDF data for each URI
ExternalURI-->>LabelResolver: Return RDF data
LabelResolver->>LabelResolver: Extract labels (SKOS prefLabel, RDF Scheme Label, Schema.org name)
LabelResolver->>Database: Check for unresolved labels
Database-->>LabelResolver: Return unresolved labels
LabelResolver->>Database: Update database with resolved labels
LabelResolver-->>Harvester: Return resolved labels
classDiagram
class resolvable_label_resolver {
+Graph label_graph
+literal_dict_from_graph(subject: str | URIRef) dict
+load_graph(uri: str | URIRef, empty_graph: bool = False) Graph
+load_and_translate_uri(subject_uri: str | URIRef) list[dict[str, str]]
}
class resolve_labels {
+resolve_labels(package_dict: dict) int
}
class get_list_unresolved_terms {
+get_list_unresolved_terms(terms: list[str], languages=RESOLVE_LANGUAGES) list[str]
}
resolvable_label_resolver --> resolve_labels : uses
resolvable_label_resolver --> get_list_unresolved_terms : uses
Change | Details | Files |
---|---|---|
Implemented a new label resolver system |
|
ckanext/fairdatapoint/resolver.py |
Modified harvester to integrate label resolution |
|
ckanext/fairdatapoint/harvesters/civity_harvester.py ckanext/fairdatapoint/harvesters/fair_data_point_civity_harvester.py |
Added comprehensive test coverage |
|
ckanext/fairdatapoint/tests/test_resolver.py ckanext/fairdatapoint/tests/test_data/fdp_profile.ttl ckanext/fairdatapoint/tests/test_data/wikidata_data_catalog_entry.ttl |
Code quality improvements |
|
ckanext/fairdatapoint/profiles.py ckanext/fairdatapoint/plugin.py ckanext/fairdatapoint/processors.py setup.py |
@sourcery-ai review
🚀 Pull Request Checklist
Title:
[ ]
A brief, descriptive title for the changes.Description:
This PR implements a label resolver in the harvester import stage.
The scope of it is quite limited for now: it attempts to resolve any URI it finds in the six fields that are translated in the gdi userportal extension plugin (fields are hardcoded for now, same as in the other plugin). If it resolves and returns something RDFlib can understand (e.g. if the URI exists and supports content negotiation), it digs through the graph to find translations. Currently three properties are understood: SKOS prefLabel, RDF Scheme Label, and Schema.org name.
There is some logic in there to prevent too much hammering of a database, namely only unresolved labels are queried and duplicates in the same dataset are filtered out. However, if multiple datasets refer to the same label, it has no problems hammering the external URI every single time on every single (re-)harvest.
Context: Right now, labels are very much hardcoded.
Changes: There is a seperate commit in here to do a whole bunch of linting. My IDE does it automatically and it is time to conform to PEP8. All functional stuff is in separate commits.
Testing: Unit testing to be implemented (that's why this PR is a draft). Testing performed against a few FDPs.
Screenshots (if applicable): See front-end.
Additional Information: N/A
Checklist:
[X]
I have checked that my code adheres to the project's style guidelines and that my code is well-commented.[X]
I have performed self-review of my own code and corrected any misspellings.[X]
I have made corresponding changes to the documentation (if applicable).[X]
My changes generate no new warnings or errors.[ ]
TBD I have added tests that prove my fix is effective or that my feature works.[ ]
TBD New and existing unit tests pass locally with my changes (existing tests pass)Summary by Sourcery
Implement a label resolver in the harvester import stage to resolve URIs and find translations using specific properties. Enhance the system to prevent excessive database queries by filtering unresolved labels and duplicates. Add unit tests to verify the new functionality.
New Features:
Enhancements:
Tests: