Closed davidmegginson closed 6 years ago
@danmihaila @alexandru-m-g Validating this dataset is timing out because of spell check. With nearly 42,000 rows, it's finding a lot of spelling outliers in the #group column, and taking a lot of time to try to find corrections.
I suggest that we make the spellcheck rule more restrictive, and apply it only to hashtags that we know are likely to have highly repetitive values, e.g. #sector, #subsector, #adm
Refactor spelling and numeric outlier tests to skip any column where the coefficient of variation > 1.0.
This is a crude way to exclude columns (numerical or text) with too high variability for outlier detection. We can tweak this value in future releases.
Confirmed working in beta by direct POST:
$ time curl -X POST --data-urlencode "url=https://data.humdata.org/hxlproxy/data.csv?url=https%3A%2F%2Fapi.acleddata.com%2Facled%2Fread.csv%3Flimit%3D0%26iso%3D586&name=ACLEDHXL&tagger-match-all=on&tagger-02-header=iso&tagger-02-tag=%23country%2Bcode&tagger-03-header=event_id_cnty&tagger-03-tag=%23event%2Bcode&tagger-05-header=event_date&tagger-05-tag=%23date%2Boccurred+&tagger-08-header=event_type&tagger-08-tag=%23event%2Btype&tagger-09-header=actor1&tagger-09-tag=%23group%2Bname%2Bfirst&tagger-10-header=assoc_actor_1&tagger-10-tag=%23group%2Bname%2Bfirst%2Bassoc&tagger-12-header=actor2&tagger-12-tag=%23group%2Bname%2Bsecond&tagger-13-header=assoc_actor_2&tagger-13-tag=%23group%2Bname%2Bsecond%2Bassoc&tagger-16-header=region&tagger-16-tag=%23region%2Bname&tagger-17-header=country&tagger-17-tag=%23country%2Bname&tagger-18-header=admin1&tagger-18-tag=%23adm1%2Bname&tagger-19-header=admin2&tagger-19-tag=%23adm2%2Bname&tagger-20-header=admin3&tagger-20-tag=%23adm3%2Bname&tagger-21-header=location&tagger-21-tag=%23loc%2Bname&tagger-22-header=latitude&tagger-22-tag=%23geo%2Blat&tagger-23-header=longitude&tagger-23-tag=%23geo%2Blon&tagger-25-header=source&tagger-25-tag=%23meta%2Bsource&tagger-27-header=notes&tagger-27-tag=%23description&tagger-28-header=fatalities&tagger-28-tag=%23affected%2Bkilled&header-row=1" https://beta.proxy.hxlstandard.org/actions/validate
{
"validator": "libhxl-python",
"timestamp": "2018-05-29T18:35:12.564633",
"is_valid": true,
"stats": {
"info": 0,
"warning": 0,
"error": 0,
"total": 0
},
"issues": []
}
real 0m41.674s
user 0m0.041s
sys 0m0.020s
Not working in Data Check. Created a Jira issue: https://humanitarian.atlassian.net/browse/HDX-5934
I agree that we need to have better rules
Confirmed that it was actually a timeout issue upstream, in gunicorn. Fixed in HXLStandard/hxl-proxy#232
Returns a 502 gateway error: 502 validation link
Data itself is parseable: original data link