Closed AleksiKnuutila closed 7 years ago
Thanks! We'll find a better approach to detect html.
This issue is eligible for https://hacktoberfest.digitalocean.com/ (possible participant is @sirex)
All html detection logic is encapsulated into helper.detect_html
function.
For now it's pretty naive: https://github.com/frictionlessdata/tabulator-py/blob/master/tabulator/helpers.py#L91-L96
I suppose we could make it smarter and even remove beatifulsoup
dependency looking for html document beginning patterns. We don't need 100% detection with many false-positives just realonable percentage of detected htmls (it's usual error when e.g. user open github csv file instead of raw github).
The detect_html method might be too sensitive, and flag valid CSV files as HTML. For instance with the following simple CSV file:
When I run the the following code:
I get the exception: