HXLStandard / libhxl-python

Python support library for the Humanitarian Exchange Language (HXL) data standard.
The Unlicense
41 stars 11 forks source link

Validation failing for ACLED Pakistan data #161

Closed davidmegginson closed 6 years ago

davidmegginson commented 6 years ago

Returns a 502 gateway error: 502 validation link

Data itself is parseable: original data link

davidmegginson commented 6 years ago

Simpler version also failing: https://beta.proxy.hxlstandard.org/data/validate?url=https%3A%2F%2Fdocs.google.com%2Fspreadsheets%2Fd%2F1NPkAr6ELyV0TC_vnXD63P0Rkk6tkTETeg-HnZLQZhdA%2Fedit%23gid%3D1371949984

davidmegginson commented 6 years ago

@danmihaila @alexandru-m-g Validating this dataset is timing out because of spell check. With nearly 42,000 rows, it's finding a lot of spelling outliers in the #group column, and taking a lot of time to try to find corrections.

I suggest that we make the spellcheck rule more restrictive, and apply it only to hashtags that we know are likely to have highly repetitive values, e.g. #sector, #subsector, #adm+name, #modality, #beneficiary, #org.

davidmegginson commented 6 years ago

Refactor spelling and numeric outlier tests to skip any column where the coefficient of variation > 1.0.

This is a crude way to exclude columns (numerical or text) with too high variability for outlier detection. We can tweak this value in future releases.

davidmegginson commented 6 years ago

Deployed to beta. Beta Proxy test link is now working, and wiki page updated.

Note that response time is still slow, as one would expect for a 42,000-row dataset, but it is no longer exponentially slow.

davidmegginson commented 6 years ago

Confirmed working in beta by direct POST:

$ time curl -X POST --data-urlencode "url=https://data.humdata.org/hxlproxy/data.csv?url=https%3A%2F%2Fapi.acleddata.com%2Facled%2Fread.csv%3Flimit%3D0%26iso%3D586&name=ACLEDHXL&tagger-match-all=on&tagger-02-header=iso&tagger-02-tag=%23country%2Bcode&tagger-03-header=event_id_cnty&tagger-03-tag=%23event%2Bcode&tagger-05-header=event_date&tagger-05-tag=%23date%2Boccurred+&tagger-08-header=event_type&tagger-08-tag=%23event%2Btype&tagger-09-header=actor1&tagger-09-tag=%23group%2Bname%2Bfirst&tagger-10-header=assoc_actor_1&tagger-10-tag=%23group%2Bname%2Bfirst%2Bassoc&tagger-12-header=actor2&tagger-12-tag=%23group%2Bname%2Bsecond&tagger-13-header=assoc_actor_2&tagger-13-tag=%23group%2Bname%2Bsecond%2Bassoc&tagger-16-header=region&tagger-16-tag=%23region%2Bname&tagger-17-header=country&tagger-17-tag=%23country%2Bname&tagger-18-header=admin1&tagger-18-tag=%23adm1%2Bname&tagger-19-header=admin2&tagger-19-tag=%23adm2%2Bname&tagger-20-header=admin3&tagger-20-tag=%23adm3%2Bname&tagger-21-header=location&tagger-21-tag=%23loc%2Bname&tagger-22-header=latitude&tagger-22-tag=%23geo%2Blat&tagger-23-header=longitude&tagger-23-tag=%23geo%2Blon&tagger-25-header=source&tagger-25-tag=%23meta%2Bsource&tagger-27-header=notes&tagger-27-tag=%23description&tagger-28-header=fatalities&tagger-28-tag=%23affected%2Bkilled&header-row=1" https://beta.proxy.hxlstandard.org/actions/validate
{
    "validator": "libhxl-python",
    "timestamp": "2018-05-29T18:35:12.564633",
    "is_valid": true,
    "stats": {
        "info": 0,
        "warning": 0,
        "error": 0,
        "total": 0
    },
    "issues": []
}
real    0m41.674s
user    0m0.041s
sys 0m0.020s
davidmegginson commented 6 years ago

Not working in Data Check. Created a Jira issue: https://humanitarian.atlassian.net/browse/HDX-5934

danmihaila commented 6 years ago

I agree that we need to have better rules

davidmegginson commented 6 years ago

Confirmed that it was actually a timeout issue upstream, in gunicorn. Fixed in HXLStandard/hxl-proxy#232