Closed Mikanebu closed 6 years ago
http:/datahub.io
uriThe difference between JS and Python versions of tableschema
about validating URI is:
"validator": "^9.2.0"
from npmAND
validator
rfc3986
When I goes deeper I realized that there is an ANCIENT & GREAT HOLY WAR called:
Accept or not Accept an URI with less or more than two //
separators:
https://daniel.haxx.se/blog/2016/05/11/my-url-isnt-your-url/ The WHATWG spec says it has to be one slash and that a parser must accept an indefinite amount of slashes. “http:/example.com” and “http:////////////////////////////////////example.com” are both equally fine. RFC 3986 and many others would disagree.
Also, most of browsers accept http:/datahub.io
uri - I just have checked.
So the pypi implementation of rfc3986 DOES accepts this kind of invalid URI; And I doubt that @roll will accept any PR where we will implement URI validator by ourselves, just for fixing this particular edge case.
Also, the URI validating REGEX is so f**king complex:
/
# protocol user host-ip port path path path querystring fragment
^
#protocol
(?:(?<scheme>[a-zA-Z][a-zA-Z\d+-.]*):)?
(?:
(?:
(?:
\/\/
(?:
#userinfo
(?:((?:[a-zA-Z\d\-._~\!$&'()*+,;=%]*)(?::(?:[a-zA-Z\d\-._~\!$&'()*+,;=:%]*))?)@)?
#host-ip
((?:[a-zA-Z\d-.%]+)|(?:\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})|(?:\[(?:[a-fA-F\d.:]+)\]))?
#port
(?::(\d*))?
)
)
#slash-path
(
(?:\/[a-zA-Z\d\-._~\!$&'()*+,;=:@%]*)*
)
)
#slash-path
|(\/(?:(?:[a-zA-Z\d\-._~\!$&'()*+,;=:@%]+(?:\/[a-zA-Z\d\-._~\!$&'()*+,;=:@%]*)*))?)
#path
|([a-zA-Z\d\-._~\!$&'()*+,;=:@%]+(?:\/[a-zA-Z\d\-._~\!$&'()*+,;=:@%]*)*)
)?
#querystring
(?:\?([a-zA-Z\d\-._~\!$&'()*+,;=:@%\/?]*))?
#fragment
(?:\#([a-zA-Z\d\-._~\!$&'()*+,;=:@%\/?]*))?
$
/x
So I'm leaving this URI validator as it is.
tableschema-py
validator accepts datetime 2018/01/02T00:00:00
format: AnyThe same situation here - py
and js
versions of tableschema
use different libs to validate datetime
This 2018/01/02T00:00:00
is a
>>> from dateutil.parser import parse
>>> parse("2018/01/02T00:00:00")
datetime.datetime(2018, 1, 2, 0, 0)
> moment = require('moment')
> moment("2018/01/02T00:00:00")
Deprecation warning: value provided is not in a recognized RFC2822 or ISO format. moment construction falls back to js Date(), which is not reliable across all browsers and versions.
So my opinion - we should not restrict tableschema-py
only because tableschema-js
could not recognize some valid dates.
FIXED:
JS
version of tableschema
gives some validation errors, while PY
version doesn't:
00:01
is not Time"2018/01/02T00:00:00"
is not datetime with format 'Any'http:/datahub.io
is not correct URI
(I agree with the last one, but developers of rfc3986 validator lib for Python are not agree)So, everything is good with our Pipeline.
00:01
is valid time format - we should fix tableschema-js
(and data validate
) instead.00:01
is invalid time format - then we should validate a dataset prior pushing, coz tableschema-py
Authors will not accept such a change anyway.And I'm going to change the corresponding QA test.
Reported here: datahq/datahub-qa#68
We want to improve showcase page, so it should have error messages as per readme of the dp, eg, when pushing following datapackages some errors are not catched in the frontend:
https://github.com/frictionlessdata/test-data/tree/master/packages/types-formats-and-constraints
string
- The value “https:/domain.com” in column “uri” is not type “string” and format “uri”datetime
- The value “2018/01/02T00:00:00” in column “any” is not type “datetime” and format “any”time
- The value “00:01” in column “any” is not type “time” and format “any”Acceptance criteria
the dataset page should have error messages on showcase page as per readme of the dp:
string
- The value “https:/domain.com” in column “uri” is not type “string” and format “uri”datetime
- The value “2018/01/02T00:00:00” in column “any” is not type “datetime” and format “any”time
- The value “00:01” in column “any” is not type “time” and format “any”Tasks
tableschema-py
tableschema-py
Analysis
data validate
usestableschema-js
and found all errorstableschema-py
and NOT found some errorsHow to reproduce
data validate
and remember errorsdata push
and see errors on the dataset pagestring:
$ data validate test-data/packages/types-formats-and-constraints/string/
$ data push test-data/packages/types-formats-and-constraints/string/
datetime:
$ data validate test-data/packages/types-formats-and-constraints/datetime/
$ data push test-data/packages/types-formats-and-constraints/datetime/
time:
$ data validate test-data/packages/types-formats-and-constraints/time/
$ data push test-data/packages/types-formats-and-constraints/time/