Unstructured-IO / unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
https://www.unstructured.io/
Apache License 2.0
8.41k stars 691 forks source link

bug(auto): file_and_type_from_url() does not recognize valid `text/html; charset=utf8` Content_Type header #2636

Open ururk opened 6 months ago

ururk commented 6 months ago

Describe the bug I came across a webpage which is being detected as a CSV file. It should be detected as html. The page, unfortunately, returns its content type as:

Content-Type: text/html; charset=utf-8

To Reproduce

file, filetype = file_and_type_from_url(
    url: "https://sites.google.com/umich.edu/mm-post-award-manual/project-management/cost-sharing-commitments-internal-agreements?authuser=0",
    headers: {'User-Agent': 'Mozilla/5.0'}
)

Expected behavior I expect it to detect the page as html, it detects it as CSV.

Additional context When the page is read in, only a portion of it is read and because Google Sites formats HTML a certain way, lines get split up like:

[
'<!DOCTYPE html><html lang="en-US" itemscope itemtype="http://schema.org/WebPage"><head><meta charset="utf-8"><script nonce="7aEdh0ByXXm4OdD9gpzmIA">var DOCS_timing={}; DOCS_timing[\'sl\']=new Date().getTime();</script><script nonce="7aEdh0ByXXm4OdD9gpzmIA">function _DumpException(e) {throw e;}</script><script nonce="7aEdh0ByXXm4OdD9gpzmIA">_docs_flag_initialData={"atari-emtpr":false,"atari-ebidm":true,"atari-ebids":true,"atari-edtm":true,"atari-eibrm":false,"atari-ectm":false,"atari-ects":false,"docs-text-elei":false,"docs-text-usc":true,"atari-bae":false,"docs-text-eessmkc":true,"docs-text-emtps":false,"docs-text-etsrdpn":false,"docs-text-etsrds":false,"docs-text-erdfs":false,"docs-text-encps":false,"docs-text-endes":false,"docs-text-escpv":true,"docs-text-ecfs":false,"docs-text-ecis":false,"docs-text-eessips":true,"docs-text-eectfs":false,"docs-text-edctzs":true,"docs-text-eetxpc":false,"docs-text-eetxp":false,"docs-text-lns":true,"docs-text-ertkmcp":true,"docs-text-ettctvs":false,"docs-text-ettts":false,"docs-text-issermps":false,"docs-text-emscts":false,"docs-text-ecgvd":false,"docs-text-esbbs":false,"docs-text-etccdts":false,"docs-text-etcchrs":false,"docs-text-etctrs":false,"docs-text-etctids":false,"docs-text-eltbbs":false,"docs-etshc":false,"docs-text-tbcb":2.0E7,"docs-efsmsdl":false,"docs-text-etb":false,"docs-text-esbefr":false,"docs-text-etof":false,"docs-text-ipi":false,"docs-text-ehlb":false,"docs-text-epa":true,"docs-text-ecls":true,"docs-text-dwit":false,"docs-text-elawp":false,"docs-eec":false,"docs-ecot":"","docs-text-enbcr":false,"docs-text-svofc":false,"docs-sup":"","umss":false,"docs-eldi":false,"docs-dli":false,"docs-liap":"/logImpressions","ilcm":{"eui":"AHKXmL0GP6UnOh4ObcyGLZyOq1-lslCu_VFbUQCm1RjpF5JAQeQnIevQskC6-rmVr_Xx1pjbMRTK","je":1,"sstu":1710189581077655,"si":"CJqZ6tOI7YQDFf4PbwYdF_IJYA","gsc":null,"ei":[5703839,5704621,5706832,5706836,5707711,5735806,5737800,5738529,5740814,5743124,5746992,5747261,5748029,5752694,5753329,5754229,5755096,5758823,5760348,5762259,5764268,5765551,5766777,5770435,5773678,5774347,5774852,5776517,5777194,5783801,5784947,5784967,5791299,5791782,5792684,5796151,5796473,5797291,14101306,14101502,14101510,14101534,49372443,49375322,49451559,49453045,49472071,49512373,49622831,49623181,49644023,49769345,49822929,49823172,49824163,49833470,49842863,49924714,50082748,50166959,50221728,50266230,50273536,50335897,50360148,50390165,50492350,50515335,50520321,50529111,50533184,50580252,50606355,70979410,71008281,71035308,71038263,71079946,71085249,71123572,71152133,71178680,71185178,71197834,71230233,71238954,71260350,71289154,71301338,71330601,71346960,71407393,71471882,71478208,71483995,71528605,71530091,71531305,71533377,71573878,71600925,71624114,71625588,71632274,71659821,71671626,71689868,71733783,71881299,71924359,71960548,94339809,94353376,94373966,94492857],"crc":0,"cvi":[]},"docs-ccdil":false,"docs-eil":true,"info_params":{},"buildLabel":"editors.sites-viewer-frontend_20240227.02_p0","docs-show_debug_info":false,"atari-jefp":"/_/view/jserror","docs-jern":"view","atari-rhpp":"/_/view","docs-ecuach":false,"docs-cclt":2033,"docs-ecci":true,"docs-esi":false,"docs-efypr":true,"docs-eyprp":false,"docs-eytpgcv":0}; _docs_flag_cek= null ; if (window[\'DOCS_timing\']) {DOCS_timing[\'ifdld\']=new Date().getTime();}</script><meta name="viewport" content="width=device-width, initial-scale=1"><meta http-equiv="X-UA-Compatible" content="IE=edge"><meta name="referrer" content="origin"><link rel="icon" href="https://lh3.googleusercontent.com/D7Lls9cfTmXrQ3tPDeQx-niO5hKS3yXYMB2K8ttobrQ9pg0as-PMZc9KGFojk9fZoiboMQUBBzIvU_fpK5hwznF5jlSRvZxxdWqiJKIHo7NR1SM"><meta property="og:title" content="Cost Sharing, Commitments &amp; Internal Agreements"><meta property="og:type" content="website"><meta property="og:url" content="https://sites.google.com/umich.edu/mm-post-award-manual/project-management/cost-sharing-commitments-internal-agreements"><meta property="og:description" content="', 
'Cost Sharing, Commitments &amp; Internal Agreements"><meta itemprop="name" content="Cost Sharing, Commitments &amp;'
]
scanny commented 6 months ago

How were you calling unstructured when you saw this problem?

When partitioning you can set the content_type parameter to specify the content type when you know it and the auto-recognition has trouble with it.

ururk commented 6 months ago

I'm using the UnstructuredURLLoader:

loader = UnstructuredURLLoader(urls=["https://sites.google.com/umich.edu/mm-post-award-manual/project-management/cost-sharing-commitments-internal-agreements?authuser=0"], continue_on_failure=False, headers={'User-Agent': 'Mozilla/5.0'})

Internally this calls (pseudo-code):

from unstructured.partition.auto import partition

elements = partition(
       url=url, headers=self.headers, **self.unstructured_kwargs
)

I suppose this "bug report" might be better put in the UnstructuredURLLoader library (since partition takes a content_type param), but the bigger problem is, the document starts with a <!DOCTYPE html> declaration, which could factor into determining whether it was a csv or not.

ururk commented 6 months ago

The other thing I should point out, charset is a valid directive in the Content-Type header, and isn't non-standard:

https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Content-Type

ururk commented 6 months ago

FYI - the comma might be removed from the page title, which may fix the issue - see Additional Context for what text it was using to detect CSV

scanny commented 6 months ago

Yeah, I was just typing that :) This is a case we should handle.

You can try the content_type argument as a workaround for now, but I'll see about getting this fixed up.

I believe you can add {"content_type": "text/html"} as the unstructured_kwargs argument to the UnstructuredURLLoader call to pass it along to unstructured.

ururk commented 6 months ago

Yeah, I was just typing that :) This is a case we should handle.

Thanks!

You can try the content_type argument as a workaround for now, but I'll see about getting this fixed up.

I believe you can add {"content_type": "text/html"} as the unstructured_kwargs argument to the UnstructuredURLLoader call to pass it along to unstructured.

Ah - yes I believe that would work. Since this is part of a bigger, automated project I'm not able to do a one-off tweak like that (but something to consider).