RDFLib / prez

Prez is a data-configurable Linked Data API framework that delivers profiles of Knowledge Graph data according to the Content Negotiation by Profile standard.
BSD 3-Clause "New" or "Revised" License
23 stars 9 forks source link

Make vowel removal optional #287

Open KoalaGeo opened 1 week ago

KoalaGeo commented 1 week ago

Would it be possible to make the vowel removal in the prefix generation optional, using a remove_vowels variable (default = true) which should be set in the config?

I believe this could be acheived like:

def generate_new_prefix(uri, remove_vowels=True):
    """
    Generates a new prefix for a uri
    """
    parsed_url = urlparse(uri)
    if bool(parsed_url.fragment):
        ns = f"{parsed_url.scheme}://{parsed_url.netloc}{parsed_url.path}#"
    else:
        ns = f'{parsed_url.scheme}://{parsed_url.netloc}{parsed_url.path.rsplit("/", 1)[0]}/'

    split_prefix_path = ns[:-1].rsplit("/", 1)
    if len(split_prefix_path) > 1:
        to_generate_prefix_from = split_prefix_path[-1].lower()
        # attempt to just use the last part of the path prior to the fragment or "identifier"
        if len(to_generate_prefix_from) <= 6:
            proposed_prefix = to_generate_prefix_from
            if not prefix_registered(proposed_prefix):
                prefix_graph.bind(proposed_prefix, ns)
                return

        if remove_vowels:
            # remove vowels to reduce length
            proposed_prefix = "".join(
                [c for c in to_generate_prefix_from if c not in "aeiou!@#$%^&*()_+-=,."]
            )
            if not valid_prefix(proposed_prefix):
                # if we still can't get a nice prefix, use an ugly but valid one using a hash of the IRI
                proposed_prefix = f"ns{hash(to_generate_prefix_from)}"
            if not prefix_registered(proposed_prefix):
                prefix_graph.bind(proposed_prefix, ns)
                return
        else:
            # Use the original string without removing vowels
            proposed_prefix = to_generate_prefix_from
            if not valid_prefix(proposed_prefix):
                # if we still can't get a nice prefix, use an ugly but valid one using a hash of the IRI
                proposed_prefix = f"ns{hash(to_generate_prefix_from)}"
            if not prefix_registered(proposed_prefix):
                prefix_graph.bind(proposed_prefix, ns)
                return
    else:
        raise ValueError("Couldn't generate a prefix for the URI")
recalcitrantsupplant commented 1 week ago

Hi Edd, I'm happy to remove the vowels altogether (so just remove !@#$%^&*()_+-=,.) rather than having another configuration option - would that suit? Thoughts @lalewis1 ?

My advice is always to specify the prefixes using the vann namespace, otherwise the prefixes could change when adding further data and restarting Prez.

KoalaGeo commented 1 week ago

Suits me!

Mainly we like our URL pattern and the prez:link doesn't really match!

  1. https://data.bgs.ac.uk/id/Lexicon/NamedRockUnit/S221 (this works in prez, but we were surprised, thought we had to use the prez:link url's, not sure if its intentional behaviour?)
  2. https://data.bgs.ac.uk/object?uri=http://data.bgs.ac.uk/id/Lexicon/NamedRockUnit/S221
  3. https://data.bgs.ac.uk/v/vocab/lxcn:NamedRockUnit/NamedRockUnit:S221

Maybe this is fixed in prez v4, and the curie can be Lexicon/NamedRockUnit?

recalcitrantsupplant commented 1 week ago

https://data.bgs.ac.uk/id/Lexicon/NamedRockUnit/S221 (this works in prez, but we were surprised, thought we had to use the prez:link url's, not sure if its intentional behaviour?)

Is this perhaps a redirect on your side outside of Prez?

The v4 endpoints are configurable, so you can have something like:

/vocab/{vocabId}/concept/{conceptId}

Is this the data here? https://github.com/BritishGeologicalSurvey/vocabularies

Happy to have a quick go running v4 with it

lalewis1 commented 1 week ago

I think its ok to keep the vowels. I've just drafted a PR to do this #289 . it also extends the set of punctuation characters to remove.

KoalaGeo commented 1 week ago

https://data.bgs.ac.uk/id/Lexicon/NamedRockUnit/S221 (this works in prez, but we were surprised, thought we had to use the prez:link url's, not sure if its intentional behaviour?)

Is this perhaps a redirect on your side outside of Prez?

The v4 endpoints are configurable, so you can have something like:

/vocab/{vocabId}/concept/{conceptId}

Is this the data here? https://github.com/BritishGeologicalSurvey/vocabularies

Happy to have a quick go running v4 with it

That'd be great if you could check thank you, & yes that's our data to load.

We've a number of redirects but that's not one of them