diachron / quality

Dataset Quality Assessment (part of WP5 of the Diachron EU FP7 project)
MIT License
8 stars 4 forks source link

Check literals beyond XSD built-ins for well-formedness #17

Open clange opened 10 years ago

clange commented 10 years ago

In https://jena.apache.org/documentation/notes/typed-literals.html (@nfriesen, thanks for pointing out this helpful guide!), I think the following sections will help us to get beyond built-in XSD data types such as numbers or dates:

  1. User defined XSD data types (i.e. those that you can conveniently define in the XML Schema language, and load from an XML Schema document)
  2. User defined non-XSD data types (entirely custom data types)

I'm initially assigning this Issue to you. Later on, you may want to split into more specific per-datatype Issues assigned to Ali.

Once we know how to handle built-in XSD data types, such as numbers or dates, we are planning to proceed to things like percentages, ISBNs, email addresses, etc. These can be defined in XML Schema as restrictions of base types, such as numbers within a range (integer percentage = integer between 0 and 100), or strings that match regular expressions (an easy way to handle email addresses or credit card numbers or to approximate ISBNs).

For a more thorough check, things like ISBNs or possibly email addresses require further work. The last digit of an ISBN is a checksum, which needs to be computed from the other digits. Suppose we were interested in validating email addresses by checking whether the server responds to pings, this would also require a custom implementation.

jerdeb commented 10 years ago

Is this issue still pending or can we close it?