GenomicDataInfrastructure / gdi-userportal-ckanext-fairdatapoint

0 stars 1 forks source link

Resolve labels during harvesting #83

Open Markus92 opened 1 week ago

Markus92 commented 1 week ago

🚀 Pull Request Checklist

This PR implements a label resolver in the harvester import stage.

The scope of it is quite limited for now: it attempts to resolve any URI it finds in the six fields that are translated in the gdi userportal extension plugin (fields are hardcoded for now, same as in the other plugin). If it resolves and returns something RDFlib can understand (e.g. if the URI exists and supports content negotiation), it digs through the graph to find translations. Currently three properties are understood: SKOS prefLabel, RDF Scheme Label, and Schema.org name.

There is some logic in there to prevent too much hammering of a database, namely only unresolved labels are queried and duplicates in the same dataset are filtered out. However, if multiple datasets refer to the same label, it has no problems hammering the external URI every single time on every single (re-)harvest.

Summary by Sourcery

Implement a label resolver in the harvester import stage to resolve URIs and find translations using specific properties. Enhance the system to prevent excessive database queries by filtering unresolved labels and duplicates. Add unit tests to verify the new functionality.

New Features:

Enhancements:

Tests:

sourcery-ai[bot] commented 1 week ago

Reviewer's Guide by Sourcery

This PR implements a label resolver in the harvester import stage to resolve URIs and find translations during data harvesting. The implementation focuses on resolving labels for specific fields using SKOS prefLabel, RDF Scheme Label, and Schema.org name properties. The code includes optimizations to prevent excessive database queries by filtering duplicates and only querying unresolved labels.

Sequence diagram for label resolution in harvester import stage

sequenceDiagram
    participant Harvester
    participant LabelResolver
    participant ExternalURI
    participant Database

    Harvester->>LabelResolver: Request label resolution for URIs
    LabelResolver->>ExternalURI: Fetch RDF data for each URI
    ExternalURI-->>LabelResolver: Return RDF data
    LabelResolver->>LabelResolver: Extract labels (SKOS prefLabel, RDF Scheme Label, Schema.org name)
    LabelResolver->>Database: Check for unresolved labels
    Database-->>LabelResolver: Return unresolved labels
    LabelResolver->>Database: Update database with resolved labels
    LabelResolver-->>Harvester: Return resolved labels

Class diagram for label resolver implementation

classDiagram
    class resolvable_label_resolver {
        +Graph label_graph
        +literal_dict_from_graph(subject: str | URIRef) dict
        +load_graph(uri: str | URIRef, empty_graph: bool = False) Graph
        +load_and_translate_uri(subject_uri: str | URIRef) list[dict[str, str]]
    }
    class resolve_labels {
        +resolve_labels(package_dict: dict) int
    }
    class get_list_unresolved_terms {
        +get_list_unresolved_terms(terms: list[str], languages=RESOLVE_LANGUAGES) list[str]
    }
    resolvable_label_resolver --> resolve_labels : uses
    resolvable_label_resolver --> get_list_unresolved_terms : uses

File-Level Changes

Change Details Files
Implemented a new label resolver system
  • Created a new resolver.py module with label resolution functionality
  • Added support for SKOS prefLabel, RDF Scheme Label, and Schema.org name properties
  • Implemented caching to prevent redundant queries for already resolved labels
  • Added support for multiple languages (currently hardcoded to 'en' and 'nl')
ckanext/fairdatapoint/resolver.py
Modified harvester to integrate label resolution
  • Added label resolution call in the import stage
  • Added configuration options for controlling label resolution
  • Updated harvester to handle resolved labels during package updates
ckanext/fairdatapoint/harvesters/civity_harvester.py
ckanext/fairdatapoint/harvesters/fair_data_point_civity_harvester.py
Added comprehensive test coverage
  • Created test cases for the label resolver functionality
  • Added tests for URI validation and processing
  • Added tests for translation handling and database updates
ckanext/fairdatapoint/tests/test_resolver.py
ckanext/fairdatapoint/tests/test_data/fdp_profile.ttl
ckanext/fairdatapoint/tests/test_data/wikidata_data_catalog_entry.ttl
Code quality improvements
  • Applied PEP8 formatting across multiple files
  • Reorganized imports for better readability
  • Updated docstrings and comments
ckanext/fairdatapoint/profiles.py
ckanext/fairdatapoint/plugin.py
ckanext/fairdatapoint/processors.py
setup.py

Tips and commands #### Interacting with Sourcery - **Trigger a new review:** Comment `@sourcery-ai review` on the pull request. - **Continue discussions:** Reply directly to Sourcery's review comments. - **Generate a GitHub issue from a review comment:** Ask Sourcery to create an issue from a review comment by replying to it. - **Generate a pull request title:** Write `@sourcery-ai` anywhere in the pull request title to generate a title at any time. - **Generate a pull request summary:** Write `@sourcery-ai summary` anywhere in the pull request body to generate a PR summary at any time. You can also use this command to specify where the summary should be inserted. #### Customizing Your Experience Access your [dashboard](https://app.sourcery.ai) to: - Enable or disable review features such as the Sourcery-generated pull request summary, the reviewer's guide, and others. - Change the review language. - Add, remove or edit custom review instructions. - Adjust other review settings. #### Getting Help - [Contact our support team](mailto:support@sourcery.ai) for questions or feedback. - Visit our [documentation](https://docs.sourcery.ai) for detailed guides and information. - Keep in touch with the Sourcery team by following us on [X/Twitter](https://x.com/SourceryAI), [LinkedIn](https://www.linkedin.com/company/sourcery-ai/) or [GitHub](https://github.com/sourcery-ai).
Markus92 commented 5 days ago

@sourcery-ai review