eyurtsev / kor

LLM(😽)
https://eyurtsev.github.io/kor/
MIT License
1.6k stars 88 forks source link

SyntaxError: 'await' outside function #205

Closed natea closed 1 year ago

natea commented 1 year ago

When I try to run the example documented here: https://eyurtsev.github.io/kor/document_extraction.html

I get this error when running the extract_from_documents function:

    document_extraction_results = await extract_from_documents(
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
SyntaxError: 'await' outside function
eyurtsev commented 1 year ago

What environment are you running the code under?

natea commented 1 year ago

After running pip install kor it still complained about the following packages that were missing: selenium, unstructured and markdownify.

So I made a requirements.txt file and installed it using pip install -r requirements.txt

kor
selenium
unstructured
markdownify

Here's the result of that install:

(kor) nateaune@Nates-MBP kor % pip install -r requirements.txt
Collecting kor (from -r requirements.txt (line 1))
  Obtaining dependency information for kor from https://files.pythonhosted.org/packages/ab/91/1b349269b587594461361c60acd62a90bd101ae7aaea709746603ee06326/kor-0.13.0-py3-none-any.whl.metadata
  Using cached kor-0.13.0-py3-none-any.whl.metadata (6.2 kB)
Collecting selenium (from -r requirements.txt (line 2))
  Obtaining dependency information for selenium from https://files.pythonhosted.org/packages/10/56/8288d1813a68c1e0638515dbb777fce6d87d0d240e683216f956145310e6/selenium-4.11.2-py3-none-any.whl.metadata
  Using cached selenium-4.11.2-py3-none-any.whl.metadata (7.0 kB)
Collecting unstructured (from -r requirements.txt (line 3))
  Obtaining dependency information for unstructured from https://files.pythonhosted.org/packages/a7/98/5ccd2b4003c6a38303832c6170bee1c3821202771121abe7af81c2adbe05/unstructured-0.9.2-py3-none-any.whl.metadata
  Downloading unstructured-0.9.2-py3-none-any.whl.metadata (23 kB)
Collecting markdownify (from -r requirements.txt (line 4))
  Using cached markdownify-0.11.6-py3-none-any.whl (16 kB)
Collecting langchain>=0.0.205 (from kor->-r requirements.txt (line 1))
  Obtaining dependency information for langchain>=0.0.205 from https://files.pythonhosted.org/packages/3d/3b/e1b71f46dd68182f781483ec6ec13db1afc359f93cac19dd0accbad536c1/langchain-0.0.262-py3-none-any.whl.metadata
  Downloading langchain-0.0.262-py3-none-any.whl.metadata (15 kB)
Collecting openai<0.28,>=0.27 (from kor->-r requirements.txt (line 1))
  Obtaining dependency information for openai<0.28,>=0.27 from https://files.pythonhosted.org/packages/67/78/7588a047e458cb8075a4089d721d7af5e143ff85a2388d4a28c530be0494/openai-0.27.8-py3-none-any.whl.metadata
  Using cached openai-0.27.8-py3-none-any.whl.metadata (13 kB)
Collecting pandas<2.0.0,>=1.5.3 (from kor->-r requirements.txt (line 1))
  Using cached pandas-1.5.3-cp310-cp310-macosx_10_9_x86_64.whl (12.0 MB)
Collecting urllib3[socks]<3,>=1.26 (from selenium->-r requirements.txt (line 2))
  Obtaining dependency information for urllib3[socks]<3,>=1.26 from https://files.pythonhosted.org/packages/9b/81/62fd61001fa4b9d0df6e31d47ff49cfa9de4af03adecf339c7bc30656b37/urllib3-2.0.4-py3-none-any.whl.metadata
  Downloading urllib3-2.0.4-py3-none-any.whl.metadata (6.6 kB)
Collecting trio~=0.17 (from selenium->-r requirements.txt (line 2))
  Obtaining dependency information for trio~=0.17 from https://files.pythonhosted.org/packages/a3/dd/b61fa61b186d3267ef3903048fbee29132963ae762fb70b08d4a3cd6f7aa/trio-0.22.2-py3-none-any.whl.metadata
  Using cached trio-0.22.2-py3-none-any.whl.metadata (4.7 kB)
Collecting trio-websocket~=0.9 (from selenium->-r requirements.txt (line 2))
  Obtaining dependency information for trio-websocket~=0.9 from https://files.pythonhosted.org/packages/a5/a6/06e2373f95c12e9e8f6b910a76c86e375348ead77ab476230640666310fb/trio_websocket-0.10.3-py3-none-any.whl.metadata
  Using cached trio_websocket-0.10.3-py3-none-any.whl.metadata (4.6 kB)
Collecting certifi>=2021.10.8 (from selenium->-r requirements.txt (line 2))
  Obtaining dependency information for certifi>=2021.10.8 from https://files.pythonhosted.org/packages/4c/dd/2234eab22353ffc7d94e8d13177aaa050113286e93e7b40eae01fbf7c3d9/certifi-2023.7.22-py3-none-any.whl.metadata
  Downloading certifi-2023.7.22-py3-none-any.whl.metadata (2.2 kB)
Collecting chardet (from unstructured->-r requirements.txt (line 3))
  Obtaining dependency information for chardet from https://files.pythonhosted.org/packages/38/6f/f5fbc992a329ee4e0f288c1fe0e2ad9485ed064cac731ed2fe47dcc38cbf/chardet-5.2.0-py3-none-any.whl.metadata
  Using cached chardet-5.2.0-py3-none-any.whl.metadata (3.4 kB)
Collecting filetype (from unstructured->-r requirements.txt (line 3))
  Using cached filetype-1.2.0-py2.py3-none-any.whl (19 kB)
Collecting python-magic (from unstructured->-r requirements.txt (line 3))
  Using cached python_magic-0.4.27-py2.py3-none-any.whl (13 kB)
Collecting lxml (from unstructured->-r requirements.txt (line 3))
  Obtaining dependency information for lxml from https://files.pythonhosted.org/packages/78/8d/96b95d704fab4a95651ceeb6022855ae5a3c631f86c6647749a2e868af92/lxml-4.9.3-cp310-cp310-macosx_11_0_x86_64.whl.metadata
  Using cached lxml-4.9.3-cp310-cp310-macosx_11_0_x86_64.whl.metadata (3.8 kB)
Collecting nltk (from unstructured->-r requirements.txt (line 3))
  Using cached nltk-3.8.1-py3-none-any.whl (1.5 MB)
Collecting tabulate (from unstructured->-r requirements.txt (line 3))
  Using cached tabulate-0.9.0-py3-none-any.whl (35 kB)
Collecting requests (from unstructured->-r requirements.txt (line 3))
  Obtaining dependency information for requests from https://files.pythonhosted.org/packages/70/8e/0e2d847013cb52cd35b38c009bb167a1a26b2ce6cd6965bf26b47bc0bf44/requests-2.31.0-py3-none-any.whl.metadata
  Downloading requests-2.31.0-py3-none-any.whl.metadata (4.6 kB)
Collecting beautifulsoup4<5,>=4.9 (from markdownify->-r requirements.txt (line 4))
  Using cached beautifulsoup4-4.12.2-py3-none-any.whl (142 kB)
Collecting six<2,>=1.15 (from markdownify->-r requirements.txt (line 4))
  Using cached six-1.16.0-py2.py3-none-any.whl (11 kB)
Collecting soupsieve>1.2 (from beautifulsoup4<5,>=4.9->markdownify->-r requirements.txt (line 4))
  Using cached soupsieve-2.4.1-py3-none-any.whl (36 kB)
Collecting PyYAML>=5.3 (from langchain>=0.0.205->kor->-r requirements.txt (line 1))
  Obtaining dependency information for PyYAML>=5.3 from https://files.pythonhosted.org/packages/96/06/4beb652c0fe16834032e54f0956443d4cc797fe645527acee59e7deaa0a2/PyYAML-6.0.1-cp310-cp310-macosx_10_9_x86_64.whl.metadata
  Using cached PyYAML-6.0.1-cp310-cp310-macosx_10_9_x86_64.whl.metadata (2.1 kB)
Collecting SQLAlchemy<3,>=1.4 (from langchain>=0.0.205->kor->-r requirements.txt (line 1))
  Obtaining dependency information for SQLAlchemy<3,>=1.4 from https://files.pythonhosted.org/packages/ae/42/101761a65b8d83efa5d87cbb61144dae557ed60087daeae89e965449963f/SQLAlchemy-2.0.19-cp310-cp310-macosx_10_9_x86_64.whl.metadata
  Using cached SQLAlchemy-2.0.19-cp310-cp310-macosx_10_9_x86_64.whl.metadata (9.4 kB)
Collecting aiohttp<4.0.0,>=3.8.3 (from langchain>=0.0.205->kor->-r requirements.txt (line 1))
  Obtaining dependency information for aiohttp<4.0.0,>=3.8.3 from https://files.pythonhosted.org/packages/f3/56/a5a062bc98e8d5848f7790963771f8354f488726a59fd650742ca7391171/aiohttp-3.8.5-cp310-cp310-macosx_10_9_x86_64.whl.metadata
  Downloading aiohttp-3.8.5-cp310-cp310-macosx_10_9_x86_64.whl.metadata (7.7 kB)
Collecting async-timeout<5.0.0,>=4.0.0 (from langchain>=0.0.205->kor->-r requirements.txt (line 1))
  Obtaining dependency information for async-timeout<5.0.0,>=4.0.0 from https://files.pythonhosted.org/packages/a7/fa/e01228c2938de91d47b307831c62ab9e4001e747789d0b05baf779a6488c/async_timeout-4.0.3-py3-none-any.whl.metadata
  Downloading async_timeout-4.0.3-py3-none-any.whl.metadata (4.2 kB)
Collecting dataclasses-json<0.6.0,>=0.5.7 (from langchain>=0.0.205->kor->-r requirements.txt (line 1))
  Obtaining dependency information for dataclasses-json<0.6.0,>=0.5.7 from https://files.pythonhosted.org/packages/97/5f/e7cc90f36152810cab08b6c9c1125e8bcb9d76f8b3018d101b5f877b386c/dataclasses_json-0.5.14-py3-none-any.whl.metadata
  Downloading dataclasses_json-0.5.14-py3-none-any.whl.metadata (22 kB)
Collecting langsmith<0.1.0,>=0.0.11 (from langchain>=0.0.205->kor->-r requirements.txt (line 1))
  Obtaining dependency information for langsmith<0.1.0,>=0.0.11 from https://files.pythonhosted.org/packages/a9/37/c07b98cdbf680714bf7fc7fa653cb722eff56a20df4232adc973fa98da30/langsmith-0.0.21-py3-none-any.whl.metadata
  Downloading langsmith-0.0.21-py3-none-any.whl.metadata (10 kB)
Collecting numexpr<3.0.0,>=2.8.4 (from langchain>=0.0.205->kor->-r requirements.txt (line 1))
  Obtaining dependency information for numexpr<3.0.0,>=2.8.4 from https://files.pythonhosted.org/packages/88/3c/8af55554773ff8d5ed344050fb09788966c9a5b63e9d8de28b60f5a04fa8/numexpr-2.8.5-cp310-cp310-macosx_10_9_x86_64.whl.metadata
  Using cached numexpr-2.8.5-cp310-cp310-macosx_10_9_x86_64.whl.metadata (8.0 kB)
Collecting numpy<2,>=1 (from langchain>=0.0.205->kor->-r requirements.txt (line 1))
  Obtaining dependency information for numpy<2,>=1 from https://files.pythonhosted.org/packages/d5/50/8aedb5ff1460e7c8527af15c6326115009e7c270ec705487155b779ebabb/numpy-1.25.2-cp310-cp310-macosx_10_9_x86_64.whl.metadata
  Downloading numpy-1.25.2-cp310-cp310-macosx_10_9_x86_64.whl.metadata (5.6 kB)
Collecting openapi-schema-pydantic<2.0,>=1.2 (from langchain>=0.0.205->kor->-r requirements.txt (line 1))
  Using cached openapi_schema_pydantic-1.2.4-py3-none-any.whl (90 kB)
Collecting pydantic<2,>=1 (from langchain>=0.0.205->kor->-r requirements.txt (line 1))
  Obtaining dependency information for pydantic<2,>=1 from https://files.pythonhosted.org/packages/58/26/ca79779dc217222d308254b4d4312108c4ac334fb63d97596e0ba0982868/pydantic-1.10.12-cp310-cp310-macosx_10_9_x86_64.whl.metadata
  Using cached pydantic-1.10.12-cp310-cp310-macosx_10_9_x86_64.whl.metadata (149 kB)
Collecting tenacity<9.0.0,>=8.1.0 (from langchain>=0.0.205->kor->-r requirements.txt (line 1))
  Using cached tenacity-8.2.2-py3-none-any.whl (24 kB)
Collecting tqdm (from openai<0.28,>=0.27->kor->-r requirements.txt (line 1))
  Obtaining dependency information for tqdm from https://files.pythonhosted.org/packages/00/e5/f12a80907d0884e6dff9c16d0c0114d81b8cd07dc3ae54c5e962cc83037e/tqdm-4.66.1-py3-none-any.whl.metadata
  Downloading tqdm-4.66.1-py3-none-any.whl.metadata (57 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 57.6/57.6 kB 5.3 MB/s eta 0:00:00
Collecting python-dateutil>=2.8.1 (from pandas<2.0.0,>=1.5.3->kor->-r requirements.txt (line 1))
  Using cached python_dateutil-2.8.2-py2.py3-none-any.whl (247 kB)
Collecting pytz>=2020.1 (from pandas<2.0.0,>=1.5.3->kor->-r requirements.txt (line 1))
  Using cached pytz-2023.3-py2.py3-none-any.whl (502 kB)
Collecting charset-normalizer<4,>=2 (from requests->unstructured->-r requirements.txt (line 3))
  Obtaining dependency information for charset-normalizer<4,>=2 from https://files.pythonhosted.org/packages/81/a0/96317ce912b512b7998434eae5e24b28bcc5f1680ad85348e31e1ca56332/charset_normalizer-3.2.0-cp310-cp310-macosx_10_9_x86_64.whl.metadata
  Downloading charset_normalizer-3.2.0-cp310-cp310-macosx_10_9_x86_64.whl.metadata (31 kB)
Collecting idna<4,>=2.5 (from requests->unstructured->-r requirements.txt (line 3))
  Using cached idna-3.4-py3-none-any.whl (61 kB)
Collecting attrs>=20.1.0 (from trio~=0.17->selenium->-r requirements.txt (line 2))
  Using cached attrs-23.1.0-py3-none-any.whl (61 kB)
Collecting sortedcontainers (from trio~=0.17->selenium->-r requirements.txt (line 2))
  Using cached sortedcontainers-2.4.0-py2.py3-none-any.whl (29 kB)
Collecting outcome (from trio~=0.17->selenium->-r requirements.txt (line 2))
  Using cached outcome-1.2.0-py2.py3-none-any.whl (9.7 kB)
Collecting sniffio (from trio~=0.17->selenium->-r requirements.txt (line 2))
  Using cached sniffio-1.3.0-py3-none-any.whl (10 kB)
Collecting exceptiongroup>=1.0.0rc9 (from trio~=0.17->selenium->-r requirements.txt (line 2))
  Obtaining dependency information for exceptiongroup>=1.0.0rc9 from https://files.pythonhosted.org/packages/fe/17/f43b7c9ccf399d72038042ee72785c305f6c6fdc6231942f8ab99d995742/exceptiongroup-1.1.2-py3-none-any.whl.metadata
  Using cached exceptiongroup-1.1.2-py3-none-any.whl.metadata (6.1 kB)
Collecting wsproto>=0.14 (from trio-websocket~=0.9->selenium->-r requirements.txt (line 2))
  Using cached wsproto-1.2.0-py3-none-any.whl (24 kB)
Collecting pysocks!=1.5.7,<2.0,>=1.5.6 (from urllib3[socks]<3,>=1.26->selenium->-r requirements.txt (line 2))
  Using cached PySocks-1.7.1-py3-none-any.whl (16 kB)
Collecting click (from nltk->unstructured->-r requirements.txt (line 3))
  Obtaining dependency information for click from https://files.pythonhosted.org/packages/1a/70/e63223f8116931d365993d4a6b7ef653a4d920b41d03de7c59499962821f/click-8.1.6-py3-none-any.whl.metadata
  Using cached click-8.1.6-py3-none-any.whl.metadata (3.0 kB)
Collecting joblib (from nltk->unstructured->-r requirements.txt (line 3))
  Obtaining dependency information for joblib from https://files.pythonhosted.org/packages/10/40/d551139c85db202f1f384ba8bcf96aca2f329440a844f924c8a0040b6d02/joblib-1.3.2-py3-none-any.whl.metadata
  Using cached joblib-1.3.2-py3-none-any.whl.metadata (5.4 kB)
Collecting regex>=2021.8.3 (from nltk->unstructured->-r requirements.txt (line 3))
  Obtaining dependency information for regex>=2021.8.3 from https://files.pythonhosted.org/packages/6b/20/8a419181449227182d61908484477d23d01b2b35211a45e838b746da8bb4/regex-2023.8.8-cp310-cp310-macosx_10_9_x86_64.whl.metadata
  Using cached regex-2023.8.8-cp310-cp310-macosx_10_9_x86_64.whl.metadata (40 kB)
Collecting multidict<7.0,>=4.5 (from aiohttp<4.0.0,>=3.8.3->langchain>=0.0.205->kor->-r requirements.txt (line 1))
  Using cached multidict-6.0.4-cp310-cp310-macosx_10_9_x86_64.whl (29 kB)
Collecting yarl<2.0,>=1.0 (from aiohttp<4.0.0,>=3.8.3->langchain>=0.0.205->kor->-r requirements.txt (line 1))
  Using cached yarl-1.9.2-cp310-cp310-macosx_10_9_x86_64.whl (65 kB)
Collecting frozenlist>=1.1.1 (from aiohttp<4.0.0,>=3.8.3->langchain>=0.0.205->kor->-r requirements.txt (line 1))
  Obtaining dependency information for frozenlist>=1.1.1 from https://files.pythonhosted.org/packages/a3/5b/c785feda30d9fda8c1b1a11941e91253f59aeaf13e87ebe908d0f3f6c628/frozenlist-1.4.0-cp310-cp310-macosx_10_9_x86_64.whl.metadata
  Downloading frozenlist-1.4.0-cp310-cp310-macosx_10_9_x86_64.whl.metadata (5.2 kB)
Collecting aiosignal>=1.1.2 (from aiohttp<4.0.0,>=3.8.3->langchain>=0.0.205->kor->-r requirements.txt (line 1))
  Using cached aiosignal-1.3.1-py3-none-any.whl (7.6 kB)
Collecting marshmallow<4.0.0,>=3.18.0 (from dataclasses-json<0.6.0,>=0.5.7->langchain>=0.0.205->kor->-r requirements.txt (line 1))
  Obtaining dependency information for marshmallow<4.0.0,>=3.18.0 from https://files.pythonhosted.org/packages/ed/3c/cebfdcad015240014ff08b883d1c0c427f2ba45ae8c6572851b6ef136cad/marshmallow-3.20.1-py3-none-any.whl.metadata
  Using cached marshmallow-3.20.1-py3-none-any.whl.metadata (7.8 kB)
Collecting typing-inspect<1,>=0.4.0 (from dataclasses-json<0.6.0,>=0.5.7->langchain>=0.0.205->kor->-r requirements.txt (line 1))
  Obtaining dependency information for typing-inspect<1,>=0.4.0 from https://files.pythonhosted.org/packages/65/f3/107a22063bf27bdccf2024833d3445f4eea42b2e598abfbd46f6a63b6cb0/typing_inspect-0.9.0-py3-none-any.whl.metadata
  Using cached typing_inspect-0.9.0-py3-none-any.whl.metadata (1.5 kB)
Collecting typing-extensions>=4.2.0 (from pydantic<2,>=1->langchain>=0.0.205->kor->-r requirements.txt (line 1))
  Obtaining dependency information for typing-extensions>=4.2.0 from https://files.pythonhosted.org/packages/ec/6b/63cc3df74987c36fe26157ee12e09e8f9db4de771e0f3404263117e75b95/typing_extensions-4.7.1-py3-none-any.whl.metadata
  Using cached typing_extensions-4.7.1-py3-none-any.whl.metadata (3.1 kB)
Collecting greenlet!=0.4.17 (from SQLAlchemy<3,>=1.4->langchain>=0.0.205->kor->-r requirements.txt (line 1))
  Using cached greenlet-2.0.2-cp310-cp310-macosx_11_0_x86_64.whl (242 kB)
Collecting h11<1,>=0.9.0 (from wsproto>=0.14->trio-websocket~=0.9->selenium->-r requirements.txt (line 2))
  Using cached h11-0.14.0-py3-none-any.whl (58 kB)
Collecting packaging>=17.0 (from marshmallow<4.0.0,>=3.18.0->dataclasses-json<0.6.0,>=0.5.7->langchain>=0.0.205->kor->-r requirements.txt (line 1))
  Using cached packaging-23.1-py3-none-any.whl (48 kB)
Collecting mypy-extensions>=0.3.0 (from typing-inspect<1,>=0.4.0->dataclasses-json<0.6.0,>=0.5.7->langchain>=0.0.205->kor->-r requirements.txt (line 1))
  Using cached mypy_extensions-1.0.0-py3-none-any.whl (4.7 kB)
Using cached kor-0.13.0-py3-none-any.whl (29 kB)
Using cached selenium-4.11.2-py3-none-any.whl (7.2 MB)
Downloading unstructured-0.9.2-py3-none-any.whl (1.4 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.4/1.4 MB 57.4 MB/s eta 0:00:00
Using cached certifi-2023.7.22-py3-none-any.whl (158 kB)
Downloading langchain-0.0.262-py3-none-any.whl (1.5 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.5/1.5 MB 63.4 MB/s eta 0:00:00
Using cached openai-0.27.8-py3-none-any.whl (73 kB)
Using cached requests-2.31.0-py3-none-any.whl (62 kB)
Using cached trio-0.22.2-py3-none-any.whl (400 kB)
Using cached trio_websocket-0.10.3-py3-none-any.whl (17 kB)
Using cached chardet-5.2.0-py3-none-any.whl (199 kB)
Using cached lxml-4.9.3-cp310-cp310-macosx_11_0_x86_64.whl (4.8 MB)
Using cached aiohttp-3.8.5-cp310-cp310-macosx_10_9_x86_64.whl (365 kB)
Downloading async_timeout-4.0.3-py3-none-any.whl (5.7 kB)
Using cached charset_normalizer-3.2.0-cp310-cp310-macosx_10_9_x86_64.whl (126 kB)
Downloading dataclasses_json-0.5.14-py3-none-any.whl (26 kB)
Using cached exceptiongroup-1.1.2-py3-none-any.whl (14 kB)
Downloading langsmith-0.0.21-py3-none-any.whl (32 kB)
Using cached numexpr-2.8.5-cp310-cp310-macosx_10_9_x86_64.whl (101 kB)
Downloading numpy-1.25.2-cp310-cp310-macosx_10_9_x86_64.whl (20.8 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 20.8/20.8 MB 56.0 MB/s eta 0:00:00
Using cached pydantic-1.10.12-cp310-cp310-macosx_10_9_x86_64.whl (2.9 MB)
Using cached PyYAML-6.0.1-cp310-cp310-macosx_10_9_x86_64.whl (189 kB)
Using cached regex-2023.8.8-cp310-cp310-macosx_10_9_x86_64.whl (294 kB)
Using cached SQLAlchemy-2.0.19-cp310-cp310-macosx_10_9_x86_64.whl (2.0 MB)
Downloading urllib3-2.0.4-py3-none-any.whl (123 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 123.9/123.9 kB 13.2 MB/s eta 0:00:00
Using cached click-8.1.6-py3-none-any.whl (97 kB)
Using cached joblib-1.3.2-py3-none-any.whl (302 kB)
Downloading tqdm-4.66.1-py3-none-any.whl (78 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 78.3/78.3 kB 10.7 MB/s eta 0:00:00
Using cached frozenlist-1.4.0-cp310-cp310-macosx_10_9_x86_64.whl (46 kB)
Using cached marshmallow-3.20.1-py3-none-any.whl (49 kB)
Using cached typing_extensions-4.7.1-py3-none-any.whl (33 kB)
Using cached typing_inspect-0.9.0-py3-none-any.whl (8.8 kB)
Installing collected packages: sortedcontainers, pytz, filetype, urllib3, typing-extensions, tqdm, tenacity, tabulate, soupsieve, sniffio, six, regex, PyYAML, python-magic, pysocks, packaging, numpy, mypy-extensions, multidict, lxml, joblib, idna, h11, greenlet, frozenlist, exceptiongroup, click, charset-normalizer, chardet, certifi, attrs, async-timeout, yarl, wsproto, typing-inspect, SQLAlchemy, requests, python-dateutil, pydantic, outcome, numexpr, nltk, marshmallow, beautifulsoup4, aiosignal, unstructured, trio, pandas, openapi-schema-pydantic, markdownify, langsmith, dataclasses-json, aiohttp, trio-websocket, openai, langchain, selenium, kor
Successfully installed PyYAML-6.0.1 SQLAlchemy-2.0.19 aiohttp-3.8.5 aiosignal-1.3.1 async-timeout-4.0.3 attrs-23.1.0 beautifulsoup4-4.12.2 certifi-2023.7.22 chardet-5.2.0 charset-normalizer-3.2.0 click-8.1.6 dataclasses-json-0.5.14 exceptiongroup-1.1.2 filetype-1.2.0 frozenlist-1.4.0 greenlet-2.0.2 h11-0.14.0 idna-3.4 joblib-1.3.2 kor-0.13.0 langchain-0.0.262 langsmith-0.0.21 lxml-4.9.3 markdownify-0.11.6 marshmallow-3.20.1 multidict-6.0.4 mypy-extensions-1.0.0 nltk-3.8.1 numexpr-2.8.5 numpy-1.25.2 openai-0.27.8 openapi-schema-pydantic-1.2.4 outcome-1.2.0 packaging-23.1 pandas-1.5.3 pydantic-1.10.12 pysocks-1.7.1 python-dateutil-2.8.2 python-magic-0.4.27 pytz-2023.3 regex-2023.8.8 requests-2.31.0 selenium-4.11.2 six-1.16.0 sniffio-1.3.0 sortedcontainers-2.4.0 soupsieve-2.4.1 tabulate-0.9.0 tenacity-8.2.2 tqdm-4.66.1 trio-0.22.2 trio-websocket-0.10.3 typing-extensions-4.7.1 typing-inspect-0.9.0 unstructured-0.9.2 urllib3-2.0.4 wsproto-1.2.0 yarl-1.9.2
natea commented 1 year ago

This is the version of Python that I'm using:

(kor) nateaune@Nates-MBP kor % python example3.py
  File "/Users/nateaune/Documents/code/kor/example3.py", line 100
    document_extraction_results = await extract_from_documents(
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
SyntaxError: 'await' outside function
(kor) nateaune@Nates-MBP kor % which python
/Users/nateaune/.pyenv/shims/python
(kor) nateaune@Nates-MBP kor % python
Python 3.10.10 (main, Mar 29 2023, 14:29:38) [Clang 14.0.0 (clang-1400.0.29.202)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>>
eyurtsev commented 1 year ago

Ah this is async code and the issue is that it's being executed from a sync environment.

You can use jupyter notebook to run the code as it's async by default.

Or else you can do something like this:

>>> async def f(): print('hello')
... 
>>> import asyncio
>>> asyncio.run(f())
hello

Wrap the await .. code in an async function and then use asyncio.run to run the code.

natea commented 1 year ago

I was able to avoid the error, but it didn't produce the dataframe results I was expecting. Here is the code I used:

async def extract():
    with get_openai_callback() as cb:
        document_extraction_results = await extract_from_documents(
            chain, split_docs, max_concurrency=5, use_uid=False, return_exceptions=True
        )
        print(f"Total Tokens: {cb.total_tokens}")
        print(f"Prompt Tokens: {cb.prompt_tokens}")
        print(f"Completion Tokens: {cb.completion_tokens}")
        print(f"Successful Requests: {cb.successful_requests}")
        print(f"Total Cost (USD): ${cb.total_cost}")
        return document_extraction_results

document_extraction_results = asyncio.run(extract())

validated_data = list(
    itertools.chain.from_iterable(
        extraction["validated_data"] for extraction in document_extraction_results
    )
)
len(validated_data)

#Extraction is not perfect, but you can use a better LLM or provide more examples!

pd.DataFrame(record.dict() for record in validated_data)
eyurtsev commented 1 year ago

What did you expect? What did you get? What was the issue an exception? bad results?

natea commented 1 year ago

I guess I had expected that it would have showed me the same results as in your example: https://eyurtsev.github.io/kor/document_extraction.html

CleanShot 2023-08-17 at 15 04 27@2x
natea commented 1 year ago

I'm seeing a lot of these errors, so maybe there's a problem with the way the ChromeDriver is set up?

Error fetching or processing a, exception: Message: invalid argument
  (Session info: headless chrome=115.0.5790.170)
Stacktrace:
0   chromedriver                        0x0000000103086a6c chromedriver + 4303468
1   chromedriver                        0x000000010307f198 chromedriver + 4272536
2   chromedriver                        0x0000000102cb33ec chromedriver + 291820
3   chromedriver                        0x0000000102c9ac44 chromedriver + 191556
4   chromedriver                        0x0000000102c988c8 chromedriver + 182472
5   chromedriver                        0x0000000102c99310 chromedriver + 185104
6   chromedriver                        0x0000000102cb5594 chromedriver + 300436
7   chromedriver                        0x0000000102d29c80 chromedriver + 777344
8   chromedriver                        0x0000000102d29628 chromedriver + 775720
9   chromedriver                        0x0000000102ce4b40 chromedriver + 494400
10  chromedriver                        0x0000000102ce5988 chromedriver + 498056
11  chromedriver                        0x0000000103047924 chromedriver + 4045092
12  chromedriver                        0x000000010304be68 chromedriver + 4062824
13  chromedriver                        0x0000000103052088 chromedriver + 4087944
14  chromedriver                        0x000000010304c96c chromedriver + 4065644
15  chromedriver                        0x0000000103024e64 chromedriver + 3903076
16  chromedriver                        0x000000010306855c chromedriver + 4179292
17  chromedriver                        0x00000001030686b4 chromedriver + 4179636
18  chromedriver                        0x0000000103078978 chromedriver + 4245880
19  libsystem_pthread.dylib             0x00000001980cbfa8 _pthread_start + 148
20  libsystem_pthread.dylib             0x00000001980c6da0 thread_start + 8

Error fetching or processing r, exception: Message: invalid argument
  (Session info: headless chrome=115.0.5790.170)
Stacktrace:
0   chromedriver                        0x0000000103086a6c chromedriver + 4303468
1   chromedriver                        0x000000010307f198 chromedriver + 4272536
2   chromedriver                        0x0000000102cb33ec chromedriver + 291820
3   chromedriver                        0x0000000102c9ac44 chromedriver + 191556
4   chromedriver                        0x0000000102c988c8 chromedriver + 182472
5   chromedriver                        0x0000000102c99310 chromedriver + 185104
6   chromedriver                        0x0000000102cb5594 chromedriver + 300436
7   chromedriver                        0x0000000102d29c80 chromedriver + 777344
8   chromedriver                        0x0000000102d29628 chromedriver + 775720
9   chromedriver                        0x0000000102ce4b40 chromedriver + 494400
10  chromedriver                        0x0000000102ce5988 chromedriver + 498056
11  chromedriver                        0x0000000103047924 chromedriver + 4045092
12  chromedriver                        0x000000010304be68 chromedriver + 4062824
13  chromedriver                        0x0000000103052088 chromedriver + 4087944
14  chromedriver                        0x000000010304c96c chromedriver + 4065644
15  chromedriver                        0x0000000103024e64 chromedriver + 3903076
16  chromedriver                        0x000000010306855c chromedriver + 4179292
17  chromedriver                        0x00000001030686b4 chromedriver + 4179636
18  chromedriver                        0x0000000103078978 chromedriver + 4245880
19  libsystem_pthread.dylib             0x00000001980cbfa8 _pthread_start + 148
20  libsystem_pthread.dylib             0x00000001980c6da0 thread_start + 8

Watch the trailer for Silo

[Silo

 Latest Episode: Jun 30](/tv/silo)
eyurtsev commented 1 year ago

@natea yeah this is an issue with the loader `from langchain.document_loaders import SeleniumURLLoader i would look online to see how to resolve this