frictionlessdata / forum

🗣 Frictionless Data Forum esp for "How do I" type questions
https://frictionlessdata.io/
10 stars 0 forks source link

[bug] Support HTTP headers for data loading? #40

Closed roll closed 4 years ago

roll commented 6 years ago

From @BobHarper1

I get an error whenever I try this with a valid url (that works through my browser), even with the example given on the blog.

urllib2.HTTPError: HTTP Error 404: Not Found

Having had a bit of a search, I would need to set the User-Agent header in the HTTP request urllib2.Request. Any idea if I could do that in datapackage.push_datapackage?

A couple more comments about what's happening (though my ability to test different datapackage urls is limited by not knowing about more).

  1. It doesn't seem to be affected by the url scheme, I've tried http:// or https:// and both have brought about the same error
  2. If I make a basic urllib2 request on the url as tested with datapackage.push_datapackage() I'm able to get the response back no problem
didiez commented 4 years ago

Today we had a conversation about this, arising some thought and ideas that could be of interest. These are the messages copy-pasted from discord:

@didiez 9:02

hi there! I'm trying to validate a datapackage with remore URIs with goodtables-py, but I got an error because these URIs are protected. Is there a canonical/recommended way to provide an Authorization header to be used by goodtables-py & datapackage-py when retrieving the remote schema descriptor? Thanks!

@jen-thomas 11:42

@didiez I don't have a solution but that would be super useful (I'm thinking of a restricted dataset in Zenodo, for context :slightly_smiling_face: )

@cpina 11:43

Some repositories might use some "fancy" authentication methods besides HTTP Auth :frowning: , I suspect that it would not work for Zenodo :frowning: but I haven't checked

@didiez 11:52

yeap, I was thinking about a way to provide some sort of plugin or extension when retrieving external resources, allowing to add headers or whatever is required to access de resource, instead of a plain requests.get(..) call

@cpina 11:53

For Zenodo for example the Zenodo API would be a way to go I think. @didiez : do you have any specific repository in mind?

@didiez 11:53

this could apply to every remote resource attribute defined in a datapackage, tableschema, etc attr

not any specific repository, just trying to retrieve a tableschema descriptor from an oauth2 protected URL

@cpina 11:55

I'm just a user here but I might add a comment in that issue/40 with a plugin system for fetching files from an API if they are restricted... and I think that some repositories might be quite unfriendly for this (I was looking at Mendeley Data a few weeks ago and I don't think that it had download buttons for all the files)

I have pending to write an issue about path and that it has problems like non-published data or restricted. I'm still thinking about local_path and remote_path for something like this. oauth2 protected might be "workable" via a plugin (to refresh the access token if need to during the download...)

@didiez 13:22

I'm not sure splitting path into local_path and remote_path is the way to go, it adds complexity and I don't see how it solves the problem. I would try to stick to the actual attributes, but allowing provide a function acting as proxy/wrapper to delegate the "retrieving task" If no "proxy/delegate" is provided, the behaviour would be as it is now.

I workarounded this problem monkey-patching the requests.get function to add the Authorization header when needed, but it feels a bit hacky

@cpina 14:08

:thumbsup: (local_path and remote_path wouldn't solve the problem but I think that it might help in some cases. I'm still thinking of this though and I haven't opened any ticket :slightly_smiling_face: )

roll commented 4 years ago

Hi @didiez,

It's not possible with datapackage, for now. But you can provide a requests session validating a single file - https://github.com/frictionlessdata/tabulator-py#httphttpsftpftps

report = goodtables.validate(source, http_session=...)
rufuspollock commented 4 years ago

FIXED. Looks like this is resolved. Please ping to reopen 😄