frictionlessdata / tabulator-py

Python library for reading and writing tabular data via streams.
https://frictionlessdata.io
MIT License
235 stars 42 forks source link

"Source has been detected as HTML" exception when source is CSV that contains HTML #101

Closed AleksiKnuutila closed 7 years ago

AleksiKnuutila commented 8 years ago

The detect_html method might be too sensitive, and flag valid CSV files as HTML. For instance with the following simple CSV file:

col1,col2
val1,"<html>"

When I run the the following code:

from tabulator import Stream

with Stream('test.csv', headers=1) as stream:
    print(stream.headers)

I get the exception:

Traceback (most recent call last): File "test.py", line 3, in with Stream('test.csv', headers=1) as stream: File "/usr/local/lib/python2.7/site-packages/tabulator/stream.py", line 133, in enter self.open() File "/usr/local/lib/python2.7/site-packages/tabulator/stream.py", line 159, in open self.__detect_html() File "/usr/local/lib/python2.7/site-packages/tabulator/stream.py", line 291, in __detect_html raise exceptions.TabulatorException(msg) tabulator.exceptions.TabulatorException: Source has been detected as HTML (not supported)

roll commented 8 years ago

Thanks! We'll find a better approach to detect html.

roll commented 8 years ago

This issue is eligible for https://hacktoberfest.digitalocean.com/ (possible participant is @sirex)

Overview

All html detection logic is encapsulated into helper.detect_html function.

For now it's pretty naive: https://github.com/frictionlessdata/tabulator-py/blob/master/tabulator/helpers.py#L91-L96

I suppose we could make it smarter and even remove beatifulsoup dependency looking for html document beginning patterns. We don't need 100% detection with many false-positives just realonable percentage of detected htmls (it's usual error when e.g. user open github csv file instead of raw github).

Tasks