Closed pierredittgen closed 4 years ago
@pierredittgen looks good, I added some comments. Can you please make sure that tests pass after this change?
@akariv @pierredittgen I have some input on this one.
I think we had this implemented already at some point in time (should be possible to find the implementation in the history). But I disabled it because it was really inaccurate (the main problem was with Github Mime-Types as I can remember).
Probably we need a flag to make this feature optional (disabled by default). Or something like this. If my remembering is correct.
I think that Github serves all raw files as text to avoid abuse by 3rd parties (as well as using them as a hosting service for images and videos). 'Regular' files from the internet should have more accurate MIME types I think.
On Mon, Oct 15, 2018 at 11:19 AM roll notifications@github.com wrote:
@akariv https://github.com/akariv @pierredittgen https://github.com/pierredittgen I have some input on this one.
I think we had this implemented already at some point in time (should be possible to find the implementation in the history). But I disabled it because it was really inaccurate (the main problem was with Github Mime-Types as I can remember).
Probably we need a flag like to make this feature optional (disabled by default). Or something like this.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/frictionlessdata/tabulator-py/pull/252#issuecomment-429750971, or mute the thread https://github.com/notifications/unsubscribe-auth/AAQMdUzkkp1LLPPU3XKQeutV5ql0hwL0ks5ulEUngaJpZM4XVSlE .
But controlling this behaviour using a flag is not a bad idea.
On Mon, Oct 15, 2018 at 11:26 AM Adam Kariv adam.kariv@gmail.com wrote:
I think that Github serves all raw files as text to avoid abuse by 3rd parties (as well as using them as a hosting service for images and videos). 'Regular' files from the internet should have more accurate MIME types I think.
On Mon, Oct 15, 2018 at 11:19 AM roll notifications@github.com wrote:
@akariv https://github.com/akariv @pierredittgen https://github.com/pierredittgen I have some input on this one.
I think we had this implemented already at some point in time (should be possible to find the implementation in the history). But I disabled it because it was really inaccurate (the main problem was with Github Mime-Types as I can remember).
Probably we need a flag like to make this feature optional (disabled by default). Or something like this.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/frictionlessdata/tabulator-py/pull/252#issuecomment-429750971, or mute the thread https://github.com/notifications/unsubscribe-auth/AAQMdUzkkp1LLPPU3XKQeutV5ql0hwL0ks5ulEUngaJpZM4XVSlE .
Has this pull-request been reviewed and accepted? What could we do @pierredittgen and I to move on?
The parameter use_http_content_type=False
was added in latest commits.
@cbenz @pierredittgen Sorry I've got out of the loop a little bit.
Happy to review but not sure what's status of the PR (the flag, tests).
What is missing to merge this PR?
Sorry for the really late reply. I'm probably slightly out of sync and can be wrong.
But I have a few questions:
use_http_content_type
flag. Will not it be always False?not sure how as a user I can set the use_http_content_type flag. Will not it be always False?
In fact, in our code, we directly call the detect_scheme_and_format
function to try to get the format of the given resource. Then, we can pass this info to the "inspect" method of goodtables.inspector.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
@roll It seems a lot of work & progress was made here but not totally finished. What's left to do in order to merge this PR? On our side, @pierredittgen can still be available to help close this. :)
@johanricher
The main problem was that it'd been intended to update the helpers.detect_scheme_and_format
function which is:
We need to add a publically available option to the remote
loader:
And probably copy this code to something like helpers.get_format_by_content_type
and use it inside the Stream.open
(we can't use it inside the remote
loader because a loader can't update a file format...). TBH not sure exactly but something like this.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Currently, when using tabulator with URL sources, format is infered from:
format
URL parameter (e.g. https://example.com/foobar?format=xls¶m=baz -> xls)One step beyond: when an URL has no format suffix, neither format url parameter, tabulator can request HEAD information and infer format from Content-type information. E.g content type 'text/csv' means csv format.
Content-type
toformat
is defined as a dict (CONTENT_TYPE_FORMAT
) that can be extended as needed to support more mime-types.