Make use of content-type to detect format

pierredittgen commented 6 years ago

Currently, when using tabulator with URL sources, format is infered from:

suffix (e.g. https://example.com/files/foobar.csv -> csv)
format URL parameter (e.g. https://example.com/foobar?format=xls&param=baz -> xls)

One step beyond: when an URL has no format suffix, neither format url parameter, tabulator can request HEAD information and infer format from Content-type information. E.g content type 'text/csv' means csv format.

Content-type to format is defined as a dict (CONTENT_TYPE_FORMAT) that can be extended as needed to support more mime-types.

akariv commented 6 years ago

@pierredittgen looks good, I added some comments. Can you please make sure that tests pass after this change?

roll commented 6 years ago

@akariv @pierredittgen I have some input on this one.

I think we had this implemented already at some point in time (should be possible to find the implementation in the history). But I disabled it because it was really inaccurate (the main problem was with Github Mime-Types as I can remember).

Probably we need a flag to make this feature optional (disabled by default). Or something like this. If my remembering is correct.

akariv commented 6 years ago

I think that Github serves all raw files as text to avoid abuse by 3rd parties (as well as using them as a hosting service for images and videos). 'Regular' files from the internet should have more accurate MIME types I think.

On Mon, Oct 15, 2018 at 11:19 AM roll notifications@github.com wrote:

@akariv https://github.com/akariv @pierredittgen https://github.com/pierredittgen I have some input on this one.

I think we had this implemented already at some point in time (should be possible to find the implementation in the history). But I disabled it because it was really inaccurate (the main problem was with Github Mime-Types as I can remember).

Probably we need a flag like to make this feature optional (disabled by default). Or something like this.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/frictionlessdata/tabulator-py/pull/252#issuecomment-429750971, or mute the thread https://github.com/notifications/unsubscribe-auth/AAQMdUzkkp1LLPPU3XKQeutV5ql0hwL0ks5ulEUngaJpZM4XVSlE .

akariv commented 6 years ago

But controlling this behaviour using a flag is not a bad idea.

On Mon, Oct 15, 2018 at 11:26 AM Adam Kariv adam.kariv@gmail.com wrote:

I think that Github serves all raw files as text to avoid abuse by 3rd parties (as well as using them as a hosting service for images and videos). 'Regular' files from the internet should have more accurate MIME types I think.

On Mon, Oct 15, 2018 at 11:19 AM roll notifications@github.com wrote:

@akariv https://github.com/akariv @pierredittgen https://github.com/pierredittgen I have some input on this one.

I think we had this implemented already at some point in time (should be possible to find the implementation in the history). But I disabled it because it was really inaccurate (the main problem was with Github Mime-Types as I can remember).

Probably we need a flag like to make this feature optional (disabled by default). Or something like this.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/frictionlessdata/tabulator-py/pull/252#issuecomment-429750971, or mute the thread https://github.com/notifications/unsubscribe-auth/AAQMdUzkkp1LLPPU3XKQeutV5ql0hwL0ks5ulEUngaJpZM4XVSlE .

cbenz commented 6 years ago

Has this pull-request been reviewed and accepted? What could we do @pierredittgen and I to move on?

The parameter use_http_content_type=False was added in latest commits.

roll commented 5 years ago

@cbenz @pierredittgen Sorry I've got out of the loop a little bit.

Happy to review but not sure what's status of the PR (the flag, tests).

johanricher commented 5 years ago

What is missing to merge this PR?

roll commented 5 years ago

Sorry for the really late reply. I'm probably slightly out of sync and can be wrong.

But I have a few questions:

so we add an additional http call to the helper function which TBH wasn't intended to be not pure. I was wondering we need to check the http header inside the Remote Loader during the same (as data collection) call. But I guess it requires some design changes (I have to check)
not sure how as a user I can set the use_http_content_type flag. Will not it be always False?
we need tests, readme's API update

pierredittgen commented 5 years ago

not sure how as a user I can set the use_http_content_type flag. Will not it be always False?

In fact, in our code, we directly call the detect_scheme_and_format function to try to get the format of the given resource. Then, we can pass this info to the "inspect" method of goodtables.inspector.

stale[bot] commented 5 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] commented 5 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

johanricher commented 5 years ago

@roll It seems a lot of work & progress was made here but not totally finished. What's left to do in order to merge this PR? On our side, @pierredittgen can still be available to help close this. :)

roll commented 5 years ago

@johanricher

The main problem was that it'd been intended to update the helpers.detect_scheme_and_format function which is:

not a part of public API
wasn't created to contain any IO calls

We need to add a publically available option to the remote loader:

https://github.com/frictionlessdata/tabulator-py#httphttpsftpftps

And probably copy this code to something like helpers.get_format_by_content_type and use it inside the Stream.open (we can't use it inside the remote loader because a loader can't update a file format...). TBH not sure exactly but something like this.

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

frictionlessdata / tabulator-py

Make use of content-type to detect format #252