ckan / datapusher

A standalone web service that pushes data files from a CKAN site resources into its DataStore
GNU Affero General Public License v3.0
77 stars 155 forks source link

Generic HTTPError in push_to_datastore #69

Closed antitoxic closed 7 years ago

antitoxic commented 9 years ago

At the national Bulgarian data portal we are getting this:

Job "push_to_datastore (trigger: RunTriggerNow, run = True, next run at: None)" raised an exception
Traceback (most recent call last):
  File "/ckan/virtualenv/lib/python2.7/site-packages/apscheduler/scheduler.py", line 512, in _run_job
    retval = job.func(*job.args, **job.kwargs)
  File "/ckan/virtualenv/src/datapusher/datapusher/jobs.py", line 387, in push_to_datastore
    records, api_key, ckan_url)
  File "/ckan/src/datapusher/datapusher/jobs.py", line 203, in send_resource_to_datastore
    check_response(r, url, 'CKAN DataStore')
  File "/ckan/virtualenv/src/datapusher/datapusher/jobs.py", line 137, in check_response
    request_url=request_url, response=response.text)
HTTPError

I can't find any other related issues. Is this a known bug?

mitio commented 9 years ago

Here are two instances of the same error, from the same system, with some more relevant log lines (grouped by the time of the event):

First one:

Fetching from: http://opendata.government.bg/dataset/dcea389f-ebb3-4bcd-aa7a-1987f28437df/resource/4b6937d0-b9fb-41fb-a4dc-2d6afc1121fe/download/2015UTF16LEwithBOM.csv
Deleting "4b6937d0-b9fb-41fb-a4dc-2d6afc1121fe" from datastore.
Determined headers and types: [{'type': u'text', 'id': u'\\u041d\\u0430\\u0446\\u0438\\u043e\\u043d\\u0430\\u043b\\u0435\\u043d \\u0440\\u0435\\u0433\\u0438\\u0441\\u0442\\u044a\\u0440 \\u043f\\u043e \\u0438\\u043d\\u0432\\u0430\\u0437\\u0438\\u0432\\u043d\\u0430 \\u043a\\u0430\\u0440\\u0434\\u0438\\u043e\\u043b\\u043e\\u0433\\u0438\\u044f'}]
Saving chunk 0
Job "push_to_datastore (trigger: RunTriggerNow, run = True, next run at: None)" raised an exception
Traceback (most recent call last):
  File "/ckan/virtualenv/lib/python2.7/site-packages/apscheduler/scheduler.py", line 512, in _run_job
    retval = job.func(*job.args, **job.kwargs)
  File "/ckan/virtualenv/src/datapusher/datapusher/jobs.py", line 387, in push_to_datastore
    records, api_key, ckan_url)
  File "/ckan/virtualenv/src/datapusher/datapusher/jobs.py", line 203, in send_resource_to_datastore
    check_response(r, url, 'CKAN DataStore')
  File "/ckan/virtualenv/src/datapusher/datapusher/jobs.py", line 137, in check_response
    request_url=request_url, response=response.text)
HTTPError

Second one:

Fetching from: http://opendata.government.bg/dataset/73244fd5-3648-4d59-a7a4-52a549aded24/resource/e208accb-ef0e-42d1-b282-02af7dc452da/download/NUTS2013BG.xls
Deleting "e208accb-ef0e-42d1-b282-02af7dc452da" from datastore.
Determined headers and types: [{'type': u'text', 'id': u'NUTS - \\u041a\\u043b\\u0430\\u0441\\u0438\\u0444\\u0438\\u043a\\u0430\\u0446\\u0438\\u044f\\u0442\\u0430 \\u043d\\u0430 \\u0442\\u0435\\u0440\\u0
438\\u0442\\u043e\\u0440\\u0438\\u0430\\u043b\\u043d\\u0438\\u0442\\u0435 \\u0435\\u0434\\u0438\\u043d\\u0438\\u0446\\u0438 \\u0437\\u0430 \\u0441\\u0442\\u0430\\u0442\\u0438\\u0441\\u0442\\u0438\\u0447\\u0435\\u0441\\u043a\\u0438 \\u0446
\\u0435\\u043b\\u0438 \\u0432 \\u0411\\u044a\\u043b\\u0433\\u0430\\u0440\\u0438\\u044f'}]
Saving chunk 0
Job "push_to_datastore (trigger: RunTriggerNow, run = True, next run at: None)" raised an exception
Traceback (most recent call last):
  File "/ckan/virtualenv/lib/python2.7/site-packages/apscheduler/scheduler.py", line 512, in _run_job
    retval = job.func(*job.args, **job.kwargs)
  File "/ckan/virtualenv/src/datapusher/datapusher/jobs.py", line 387, in push_to_datastore
    records, api_key, ckan_url)
  File "/ckan/virtualenv/src/datapusher/datapusher/jobs.py", line 203, in send_resource_to_datastore
    check_response(r, url, 'CKAN DataStore')
  File "/ckan/virtualenv/src/datapusher/datapusher/jobs.py", line 137, in check_response
    request_url=request_url, response=response.text)
HTTPError

It almost seems like the error is caused by a problem with parsing the uploaded file. These are usually repeated twice with the second one appearing a few seconds after the first error.

ykhadilkar commented 9 years ago

I am getting similar error Error: [u' File "/usr/lib/ckan/datapusher/lib/python2.7/site-packages/apscheduler/scheduler.py", line 512, in _run_job\n retval = job.func(_job.args, *_job.kwargs)\n', u' File "/usr/lib/ckan/datapusher/src/datapusher/datapusher/jobs.py", line 222, in push_to_datastore\n resource = get_resource(resource_id, ckan_url, api_key)\n', u' File "/usr/lib/ckan/datapusher/src/datapusher/datapusher/jobs.py", line 182, in get_resource\n return r.json()[\'result\']\n', u' File "/usr/lib/ckan/datapusher/lib/python2.7/site-packages/requests/models.py", line 819, in json\n return json.loads(self.text, **kwargs)\n', u' File "/usr/lib/python2.7/json/init.py", line 338, in loads\n return _default_decoder.decode(s)\n', u' File "/usr/lib/python2.7/json/decoder.py", line 366, in decode\n obj, end = self.raw_decode(s, idx=_w(s, 0).end())\n', u' File "/usr/lib/python2.7/json/decoder.py", line 384, in raw_decode\n raise ValueError("No JSON object could be decoded")\n', u"ValueError('No JSON object could be decoded',)"]

CSUGMansoor commented 9 years ago

Has anyone found a resolution to this issue as posted by antitoxic and mitio. The other one seems a bit different. thanks!

bunnis commented 9 years ago

Im also having the same error. So far I believe current requirements use a different APScheduler than the one needed to avoid this error

antitoxic commented 9 years ago

@bunnis what do you mean by a "different APScheduler than the one needed" ?

@amercader you seem to be the active commiter on this repo. The national Bulgarian portal is currently supported by volunteers; we would appriciate some insight.

bunnis commented 9 years ago

@antitoxic my error goes as follow: Job "push_to_datastore (trigger: RunTriggerNow, run = True, next run at: None)" raised an exception Traceback (most recent call last): File "/usr/lib/ckan/default/local/lib/python2.7/site-packages/apscheduler/scheduler.py", line 512, in _run_job retval = job.func(_job.args, *_job.kwargs)

I just started using ckan. so far the documentation is really bad. I found that using recent version from datapusher ask for requirements on ckanserviceprovider 0.0.5, which i couldnt find. Im using 0.0.4. Also ckan requirements asks for SQLAlchemy version 0.9.6, but installs version 0.7. Had to manually correct. Since the error comes from "apscheduler/scheduler.py" and "OB_CONFIG='/home/bunny/datapusher/deployment/datapusher_settings.py python wsgi.py'" from the documentation doesnt work, Im assuming this is where the error comes from, wrong APScheduler version. BTW to bypass JOB_CONFIG use "python datapusher/main.py deployment/datapusher_settings.py"

amercader commented 9 years ago

@antitoxic @bunnis @mitio et al, there seems to a bit of confusion about what the actual issue is.

The first two comments (the ones that raise HTTPError) are likely caused because the DataPusher has trouble accessing or parsing the files.

This file requires a login to access, so datapusher can not access it.

This file has some filters on the first row which probably make datapusher choke

I know it's not ideal, but I suggest looking at the original files for potential issues. Recent CKAN versions have slightly better error messages for this kind of stuff on the "DataStore" tab of the "Manage" section of a resource page. Which CKAN version are you on?

The APScheduler looks like a completely separate issue, so I suggest creating a new issue for it.

bunnis commented 9 years ago

@amercader Im using latest stable version, 2.4.1 I believe. Im following the documentation on ckan website on isntalling from soruce on a ubuntu 14.04.3. I started with basic csv data. This is my csv file (very small and straightforward): nome,cenas,exemplo pedro,1,dez joao,40,um nuno,90,nove

mitio commented 9 years ago

@amercader We're using CKAN 2.3 (ckan/ckan@f478c92e3d88844d9e417d68b171e03c7e040155) and Datapusher is ckan/ckan-datapusher@f30d02200e379f62c86a6f038ffefa33ec24571e (with minor changes on our end).

CSUGMansoor commented 9 years ago

The issue it appears for us was the version of Flask-Login. We are using 2.4.0 on RHEL7 and installed from source so after doing a requirements check/install, we manually installed version 0.2.11 of Flask-Login

 pip install 'Flask-Login==0.2.11'

Thanks all for reminding me to circle back and post an update.

ermueller commented 8 years ago

For me it shows the same error no matter what file or what type of file I'm trying to upload:

Job "push_to_datastore (trigger: RunTriggerNow, run = True, next run at: None)" raised an exception
Traceback (most recent call last):
File "/usr/lib/ckan/datapusher/lib/python2.7/site-packages/apscheduler/scheduler.py", line 512, in _run_job
 retval = job.func(*job.args, **job.kwargs)
 File "/usr/lib/ckan/datapusher/src/datapusher/datapusher/jobs.py", line 285, in push_to_datastore
resource = get_resource(resource_id, ckan_url, api_key)
File "/usr/lib/ckan/datapusher/src/datapusher/datapusher/jobs.py", line 233, in get_resource
check_response(r, url, 'CKAN')
File "/usr/lib/ckan/datapusher/src/datapusher/datapusher/jobs.py", line 137, in check_response
    request_url=request_url, response=response.text)
HTTPError

I'm really at a loss, since even if I install a clean CKAN from package and configure all necessary datapusher options this happens. Is there anything I could try as a workaround?

The datapusher itself seems to be responsive, since I can call localhost:8800 and it tells me about the help page like it should.

EDIT: I solved the problem for my current setup: The one thing I did was that I changed the port of the ckan to port 80 and I did something wrong in the apache2 port configuration. I used the an outdated apache2 ports.conf for reference, but then I looked at what actually comes with the packaged CKAN and noticed my error.

I'm just writing this here in case someone else has this problem. Sorry, I should have checked that more thoroughly

davidread commented 7 years ago

@antitoxic @mitio From the trace it looks like the problem is the request datapusher makes to ckan to insert the data into the datastore. I should check that your configured ckan.site_url is correct and accessible.

Further debugging information is available with this pull request: https://github.com/ckan/datapusher/pull/121

I'm closing this issue since it is so old and long-winded, but feel free to create a new issue with the fuller debugging information.

TomoyaJ commented 7 years ago

hey guys, I just getting the similarly issues and I created a new one, it's waiting for help answer now. url is:

https://github.com/TomoyaJ/Python2/issues/1

kdwarn commented 3 years ago

Adding my experience as it may be helpful to others: was getting this similar error ("Internal Server Error" in CKAN and "raise HTTPError(datapusher.jobs.HTTPError: " in logs) on some resources, but not others. Even of same filetype - xls. One file not working had filters on it, so I removed them, saved, and tried again. Same error. I tried to convert to xlsx, same error. Then when I converted to csv, I got a more useful error in the CKAN interface: "ReadError('Error reading CSV: %r', err) ReadError('Error reading CSV: %r', Error('line contains NUL'))". Still can't get it into datastore, but at least I'm getting a reason now.