Open abrichr opened 1 year ago
Two issues:
REPLAY = DemoReplayStrategy(RECORDING)
@dianzrong can you please modify this to test only the summary mixin and not other functionality?
Not sure what's going on here, @jesicasusanto can you please take a look?
@abrichr Even I got something similar on Windows: See below,
(openadapt-py3.10) PS P:\OpenAdapt AI - MLDS AI\cloned_repo\OpenAdapt> pytest
================================ test session starts =================================
platform win32 -- Python 3.10.11, pytest-7.1.3, pluggy-1.0.0
rootdir: P:\OpenAdapt AI - MLDS AI\cloned_repo\OpenAdapt
plugins: anyio-3.7.0
collected 23 items / 1 error
=========================================================== ERRORS ============================================================
___________________________________ ERROR collecting tests/openadapt/test_summary_mixin.py ____________________________________
tests\openadapt\test_summary_mixin.py:10: in <module>
REPLAY = DemoReplayStrategy(RECORDING)
openadapt\strategies\demo.py:41: in __init__
self.screenshots = get_screenshots(recording)
openadapt\crud.py:149: in get_screenshots
screenshots[0].prev = screenshots[0]
E IndexError: list index out of range
------------------------------------------------------- Captured stderr -------------------------------------------------------
2023-06-20 13:52:09.040 | INFO | openadapt.strategies.mixins.sam:_initialize_model:58 - downloading checkpoint_url='https://dl.fbaipublicfiles.com/segment_anything/sam_vit_h_4b8939.pth' to checkpoint_file_path=WindowsPath('checkpoints/sam_vit_h_4b8939.pth')
2023-06-20 13:57:01.191 | INFO | openadapt.strategies.mixins.huggingface:__init__:32 - model_name='gpt2'
Downloading (…)lve/main/config.json: 100%|██████████| 665/665 [00:00<?, ?B/s]
Downloading (…)olve/main/vocab.json: 100%|██████████| 1.04M/1.04M [00:00<00:00, 5.46MB/s]
Downloading (…)olve/main/merges.txt: 100%|██████████| 456k/456k [00:00<00:00, 5.08MB/s]
Downloading (…)/main/tokenizer.json: 100%|██████████| 1.36M/1.36M [00:00<00:00, 7.68MB/s]
Downloading pytorch_model.bin: 100%|██████████| 548M/548M [00:47<00:00, 11.6MB/s]
Downloading (…)neration_config.json: 100%|██████████| 124/124 [00:00<?, ?B/s]
====================================================== warnings summary =======================================================
C:\Users\Krish Patel\AppData\Local\pypoetry\Cache\virtualenvs\openadapt-NIwuSzHt-py3.10\lib\site-packages\fuzzywuzzy\fuzz.py:11
C:\Users\Krish Patel\AppData\Local\pypoetry\Cache\virtualenvs\openadapt-NIwuSzHt-py3.10\lib\site-packages\fuzzywuzzy\fuzz.py:11: UserWarning: Using slow pure-python SequenceMatcher. Install python-Levenshtein to remove this warning
warnings.warn('Using slow pure-python SequenceMatcher. Install python-Levenshtein to remove this warning')
C:\Users\Krish Patel\AppData\Local\pypoetry\Cache\virtualenvs\openadapt-NIwuSzHt-py3.10\lib\site-packages\onnxruntime\capi\_pybind_state.py:28
C:\Users\Krish Patel\AppData\Local\pypoetry\Cache\virtualenvs\openadapt-NIwuSzHt-py3.10\lib\site-packages\onnxruntime\capi\_pybind_state.py:28: DeprecationWarning: invalid escape sequence '\S'
"(other than %SystemRoot%\System32), "
C:\Users\Krish Patel\AppData\Local\pypoetry\Cache\virtualenvs\openadapt-NIwuSzHt-py3.10\lib\site-packages\pkg_resources\__init__.py:121
C:\Users\Krish Patel\AppData\Local\pypoetry\Cache\virtualenvs\openadapt-NIwuSzHt-py3.10\lib\site-packages\pkg_resources\__init__.py:121: DeprecationWarning: pkg_resources is deprecated as an API
warnings.warn("pkg_resources is deprecated as an API", DeprecationWarning)
C:\Users\Krish Patel\AppData\Local\pypoetry\Cache\virtualenvs\openadapt-NIwuSzHt-py3.10\lib\site-packages\pkg_resources\__init__.py:2870
C:\Users\Krish Patel\AppData\Local\pypoetry\Cache\virtualenvs\openadapt-NIwuSzHt-py3.10\lib\site-packages\pkg_resources\__init__.py:2870: DeprecationWarning: Deprecated call to `pkg_resources.declare_namespace('mpl_toolkits')`.
Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages
declare_namespace(pkg)
C:\Users\Krish Patel\AppData\Local\pypoetry\Cache\virtualenvs\openadapt-NIwuSzHt-py3.10\lib\site-packages\pkg_resources\__init__.py:2870
C:\Users\Krish Patel\AppData\Local\pypoetry\Cache\virtualenvs\openadapt-NIwuSzHt-py3.10\lib\site-packages\pkg_resources\__init__.py:2870
C:\Users\Krish Patel\AppData\Local\pypoetry\Cache\virtualenvs\openadapt-NIwuSzHt-py3.10\lib\site-packages\pkg_resources\__init__.py:2870
C:\Users\Krish Patel\AppData\Local\pypoetry\Cache\virtualenvs\openadapt-NIwuSzHt-py3.10\lib\site-packages\pkg_resources\__init__.py:2870
C:\Users\Krish Patel\AppData\Local\pypoetry\Cache\virtualenvs\openadapt-NIwuSzHt-py3.10\lib\site-packages\pkg_resources\__init__.py:2870: DeprecationWarning: Deprecated call to `pkg_resources.declare_namespace('sphinxcontrib')`.
Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages
declare_namespace(pkg)
C:\Users\Krish Patel\AppData\Local\pypoetry\Cache\virtualenvs\openadapt-NIwuSzHt-py3.10\lib\site-packages\huggingface_hub\file_download.py:133
C:\Users\Krish Patel\AppData\Local\pypoetry\Cache\virtualenvs\openadapt-NIwuSzHt-py3.10\lib\site-packages\huggingface_hub\file_download.py:133: UserWarning: `huggingface_hub` cache-system uses symlinks by default to efficiently store duplicated files but your machine does not support them in C:\Users\Krish Patel\.cache\huggingface\hub. Caching files will still work but in a degraded version that might require more space on your disk. This warning can be disabled by setting the `HF_HUB_DISABLE_SYMLINKS_WARNING` environment variable. For more details, see https://huggingface.co/docs/huggingface_hub/how-to-cache#limitations.
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
warnings.warn(message)
-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
=================================================== short test summary info ===================================================
ERROR tests/openadapt/test_summary_mixin.py - IndexError: list index out of range
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! Interrupted: 1 error during collection !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
========================================== 9 warnings, 1 error in 388.13s (0:06:28) ===========================================
(openadapt-py3.10) PS P:\OpenAdapt AI - MLDS AI\cloned_repo\OpenAdapt>
This is what I get on running pytest on my newly cloned repo:
Any thoughts why 2 tests are failing ? Note: I have not ran record
yet.
(openadapt-py3.10) PS P:\OpenAdapt AI - MLDS AI\cloned_repo\OpenAdapt> pytest
============================================================================================== test session starts ==============================================================================================
platform win32 -- Python 3.10.11, pytest-7.1.3, pluggy-1.0.0
rootdir: P:\OpenAdapt AI - MLDS AI\cloned_repo\OpenAdapt
plugins: anyio-3.7.0
collected 25 items
tests\openadapt\test_crop.py . [ 4%]
tests\openadapt\test_events.py ....... [ 32%]
tests\openadapt\test_scrub.py ............... [ 92%]
tests\openadapt\test_summary.py FF [100%]
=================================================================================================== FAILURES ====================================================================================================
______________________________________________________________________________________________ test_summary_empty _______________________________________________________________________________________________
self = <sumy.nlp.tokenizers.Tokenizer object at 0x0000019241BDC7F0>, language = 'english'
def _get_sentence_tokenizer(self, language):
if language in self.SPECIAL_SENTENCE_TOKENIZERS:
return self.SPECIAL_SENTENCE_TOKENIZERS[language]
try:
path = to_string("tokenizers/punkt/%s.pickle") % to_string(language)
> return nltk.data.load(path)
C:\Users\Krish Patel\AppData\Local\pypoetry\Cache\virtualenvs\openadapt-NIwuSzHt-py3.10\lib\site-packages\sumy\nlp\tokenizers.py:172:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
resource_url = 'nltk:tokenizers/punkt/english.pickle', format = 'pickle', cache = True, verbose = False, logic_parser = None, fstruct_reader = None, encoding = None
def load(
resource_url,
format="auto",
cache=True,
verbose=False,
logic_parser=None,
fstruct_reader=None,
encoding=None,
):
"""
Load a given resource from the NLTK data package. The following
resource formats are currently supported:
- ``pickle``
- ``json``
- ``yaml``
- ``cfg`` (context free grammars)
- ``pcfg`` (probabilistic CFGs)
- ``fcfg`` (feature-based CFGs)
- ``fol`` (formulas of First Order Logic)
- ``logic`` (Logical formulas to be parsed by the given logic_parser)
- ``val`` (valuation of First Order Logic model)
- ``text`` (the file contents as a unicode string)
- ``raw`` (the raw file contents as a byte string)
If no format is specified, ``load()`` will attempt to determine a
format based on the resource name's file extension. If that
fails, ``load()`` will raise a ``ValueError`` exception.
For all text formats (everything except ``pickle``, ``json``, ``yaml`` and ``raw``),
it tries to decode the raw contents using UTF-8, and if that doesn't
work, it tries with ISO-8859-1 (Latin-1), unless the ``encoding``
is specified.
:type resource_url: str
:param resource_url: A URL specifying where the resource should be
loaded from. The default protocol is "nltk:", which searches
for the file in the the NLTK data package.
:type cache: bool
:param cache: If true, add this resource to a cache. If load()
finds a resource in its cache, then it will return it from the
cache rather than loading it.
:type verbose: bool
:param verbose: If true, print a message when loading a resource.
Messages are not displayed when a resource is retrieved from
the cache.
:type logic_parser: LogicParser
:param logic_parser: The parser that will be used to parse logical
expressions.
:type fstruct_reader: FeatStructReader
:param fstruct_reader: The parser that will be used to parse the
feature structure of an fcfg.
:type encoding: str
:param encoding: the encoding of the input; only used for text formats.
"""
resource_url = normalize_resource_url(resource_url)
resource_url = add_py3_data(resource_url)
# Determine the format of the resource.
if format == "auto":
resource_url_parts = resource_url.split(".")
ext = resource_url_parts[-1]
if ext == "gz":
ext = resource_url_parts[-2]
format = AUTO_FORMATS.get(ext)
if format is None:
raise ValueError(
"Could not determine format for %s based "
'on its file\nextension; use the "format" '
"argument to specify the format explicitly." % resource_url
)
if format not in FORMATS:
raise ValueError(f"Unknown format type: {format}!")
# If we've cached the resource, then just return it.
if cache:
resource_val = _resource_cache.get((resource_url, format))
if resource_val is not None:
if verbose:
print(f"<<Using cached copy of {resource_url}>>")
return resource_val
# Let the user know what's going on.
if verbose:
print(f"<<Loading {resource_url}>>")
# Load the resource.
> opened_resource = _open(resource_url)
C:\Users\Krish Patel\AppData\Local\pypoetry\Cache\virtualenvs\openadapt-NIwuSzHt-py3.10\lib\site-packages\nltk\data.py:750:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
resource_url = 'nltk:tokenizers/punkt/english.pickle'
def _open(resource_url):
"""
Helper function that returns an open file object for a resource,
given its resource URL. If the given resource URL uses the "nltk:"
protocol, or uses no protocol, then use ``nltk.data.find`` to find
its path, and open it with the given mode; if the resource URL
uses the 'file' protocol, then open the file with the given mode;
otherwise, delegate to ``urllib2.urlopen``.
:type resource_url: str
:param resource_url: A URL specifying where the resource should be
loaded from. The default protocol is "nltk:", which searches
for the file in the the NLTK data package.
"""
resource_url = normalize_resource_url(resource_url)
protocol, path_ = split_resource_url(resource_url)
if protocol is None or protocol.lower() == "nltk":
> return find(path_, path + [""]).open()
C:\Users\Krish Patel\AppData\Local\pypoetry\Cache\virtualenvs\openadapt-NIwuSzHt-py3.10\lib\site-packages\nltk\data.py:876:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
resource_name = 'tokenizers/punkt/english.pickle'
paths = ['C:\\Users\\Krish Patel/nltk_data', 'C:\\Users\\Krish Patel\\AppData\\Local\\pypoetry\\Cache\\virtualenvs\\openadapt-...penadapt-NIwuSzHt-py3.10\\lib\\nltk_data', 'C:\\Users\\Krish Patel\\AppData\\Roaming\\nltk_data', 'C:\\nltk_data', ...]
def find(resource_name, paths=None):
"""
Find the given resource by searching through the directories and
zip files in paths, where a None or empty string specifies an absolute path.
Returns a corresponding path name. If the given resource is not
found, raise a ``LookupError``, whose message gives a pointer to
the installation instructions for the NLTK downloader.
Zip File Handling:
- If ``resource_name`` contains a component with a ``.zip``
extension, then it is assumed to be a zipfile; and the
remaining path components are used to look inside the zipfile.
- If any element of ``nltk.data.path`` has a ``.zip`` extension,
then it is assumed to be a zipfile.
- If a given resource name that does not contain any zipfile
component is not found initially, then ``find()`` will make a
second attempt to find that resource, by replacing each
component *p* in the path with *p.zip/p*. For example, this
allows ``find()`` to map the resource name
``corpora/chat80/cities.pl`` to a zip file path pointer to
``corpora/chat80.zip/chat80/cities.pl``.
- When using ``find()`` to locate a directory contained in a
zipfile, the resource name must end with the forward slash
character. Otherwise, ``find()`` will not locate the
directory.
:type resource_name: str or unicode
:param resource_name: The name of the resource to search for.
Resource names are posix-style relative path names, such as
``corpora/brown``. Directory names will be
automatically converted to a platform-appropriate path separator.
:rtype: str
"""
resource_name = normalize_resource_name(resource_name, True)
# Resolve default paths at runtime in-case the user overrides
# nltk.data.path
if paths is None:
paths = path
# Check if the resource name includes a zipfile name
m = re.match(r"(.*\.zip)/?(.*)$|", resource_name)
zipfile, zipentry = m.groups()
# Check each item in our path
for path_ in paths:
# Is the path item a zipfile?
if path_ and (os.path.isfile(path_) and path_.endswith(".zip")):
try:
return ZipFilePathPointer(path_, resource_name)
except OSError:
# resource not in zipfile
continue
# Is the path item a directory or is resource_name an absolute path?
elif not path_ or os.path.isdir(path_):
if zipfile is None:
p = os.path.join(path_, url2pathname(resource_name))
if os.path.exists(p):
if p.endswith(".gz"):
return GzipFileSystemPathPointer(p)
else:
return FileSystemPathPointer(p)
else:
p = os.path.join(path_, url2pathname(zipfile))
if os.path.exists(p):
try:
return ZipFilePathPointer(p, zipentry)
except OSError:
# resource not in zipfile
continue
# Fallback: if the path doesn't include a zip file, then try
# again, assuming that one of the path components is inside a
# zipfile of the same name.
if zipfile is None:
pieces = resource_name.split("/")
for i in range(len(pieces)):
modified_name = "/".join(pieces[:i] + [pieces[i] + ".zip"] + pieces[i:])
try:
return find(modified_name, paths)
except LookupError:
pass
# Identify the package (i.e. the .zip file) to download.
resource_zipname = resource_name.split("/")[1]
if resource_zipname.endswith(".zip"):
resource_zipname = resource_zipname.rpartition(".")[0]
# Display a friendly error message if the resource wasn't found:
msg = str(
"Resource \33[93m{resource}\033[0m not found.\n"
"Please use the NLTK Downloader to obtain the resource:\n\n"
"\33[31m" # To display red text in terminal.
">>> import nltk\n"
">>> nltk.download('{resource}')\n"
"\033[0m"
).format(resource=resource_zipname)
msg = textwrap_indent(msg)
msg += "\n For more information see: https://www.nltk.org/data.html\n"
msg += "\n Attempted to load \33[93m{resource_name}\033[0m\n".format(
resource_name=resource_name
)
msg += "\n Searched in:" + "".join("\n - %r" % d for d in paths)
sep = "*" * 70
resource_not_found = f"\n{sep}\n{msg}\n{sep}\n"
> raise LookupError(resource_not_found)
E LookupError:
E **********************************************************************
E Resource punkt not found.
E Please use the NLTK Downloader to obtain the resource:
E
E >>> import nltk
E >>> nltk.download('punkt')
E
E For more information see: https://www.nltk.org/data.html
E
E Attempted to load tokenizers/punkt/english.pickle
E
E Searched in:
E - 'C:\\Users\\Krish Patel/nltk_data'
E - 'C:\\Users\\Krish Patel\\AppData\\Local\\pypoetry\\Cache\\virtualenvs\\openadapt-NIwuSzHt-py3.10\\nltk_data'
E - 'C:\\Users\\Krish Patel\\AppData\\Local\\pypoetry\\Cache\\virtualenvs\\openadapt-NIwuSzHt-py3.10\\share\\nltk_data'
E - 'C:\\Users\\Krish Patel\\AppData\\Local\\pypoetry\\Cache\\virtualenvs\\openadapt-NIwuSzHt-py3.10\\lib\\nltk_data'
E - 'C:\\Users\\Krish Patel\\AppData\\Roaming\\nltk_data'
E - 'C:\\nltk_data'
E - 'D:\\nltk_data'
E - 'E:\\nltk_data'
E - ''
E **********************************************************************
C:\Users\Krish Patel\AppData\Local\pypoetry\Cache\virtualenvs\openadapt-NIwuSzHt-py3.10\lib\site-packages\nltk\data.py:583: LookupError
During handling of the above exception, another exception occurred:
def test_summary_empty():
empty_text = ""
> actual = REPLAY.get_summary(empty_text, 1)
tests\openadapt\test_summary.py:28:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
openadapt\strategies\mixins\summary.py:48: in get_summary
parser = PlaintextParser.from_string(text, Tokenizer("english"))
C:\Users\Krish Patel\AppData\Local\pypoetry\Cache\virtualenvs\openadapt-NIwuSzHt-py3.10\lib\site-packages\sumy\nlp\tokenizers.py:160: in __init__
self._sentence_tokenizer = self._get_sentence_tokenizer(tokenizer_language)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = <sumy.nlp.tokenizers.Tokenizer object at 0x0000019241BDC7F0>, language = 'english'
def _get_sentence_tokenizer(self, language):
if language in self.SPECIAL_SENTENCE_TOKENIZERS:
return self.SPECIAL_SENTENCE_TOKENIZERS[language]
try:
path = to_string("tokenizers/punkt/%s.pickle") % to_string(language)
return nltk.data.load(path)
except (LookupError, zipfile.BadZipfile) as e:
> raise LookupError(
"NLTK tokenizers are missing or the language is not supported.\n"
"""Download them by following command: python -c "import nltk; nltk.download('punkt')"\n"""
"Original error was:\n" + str(e)
)
E LookupError: NLTK tokenizers are missing or the language is not supported.
E Download them by following command: python -c "import nltk; nltk.download('punkt')"
E Original error was:
E
E **********************************************************************
E Resource punkt not found.
E Please use the NLTK Downloader to obtain the resource:
E
E >>> import nltk
E >>> nltk.download('punkt')
E
E For more information see: https://www.nltk.org/data.html
E
E Attempted to load tokenizers/punkt/english.pickle
E
E Searched in:
E - 'C:\\Users\\Krish Patel/nltk_data'
E - 'C:\\Users\\Krish Patel\\AppData\\Local\\pypoetry\\Cache\\virtualenvs\\openadapt-NIwuSzHt-py3.10\\nltk_data'
E - 'C:\\Users\\Krish Patel\\AppData\\Local\\pypoetry\\Cache\\virtualenvs\\openadapt-NIwuSzHt-py3.10\\share\\nltk_data'
E - 'C:\\Users\\Krish Patel\\AppData\\Local\\pypoetry\\Cache\\virtualenvs\\openadapt-NIwuSzHt-py3.10\\lib\\nltk_data'
E - 'C:\\Users\\Krish Patel\\AppData\\Roaming\\nltk_data'
E - 'C:\\nltk_data'
E - 'D:\\nltk_data'
E - 'E:\\nltk_data'
E - ''
E **********************************************************************
C:\Users\Krish Patel\AppData\Local\pypoetry\Cache\virtualenvs\openadapt-NIwuSzHt-py3.10\lib\site-packages\sumy\nlp\tokenizers.py:174: LookupError
_____________________________________________________________________________________________ test_summary_sentence _____________________________________________________________________________________________
self = <sumy.nlp.tokenizers.Tokenizer object at 0x0000019241BCBE80>, language = 'english'
def _get_sentence_tokenizer(self, language):
if language in self.SPECIAL_SENTENCE_TOKENIZERS:
return self.SPECIAL_SENTENCE_TOKENIZERS[language]
try:
path = to_string("tokenizers/punkt/%s.pickle") % to_string(language)
> return nltk.data.load(path)
C:\Users\Krish Patel\AppData\Local\pypoetry\Cache\virtualenvs\openadapt-NIwuSzHt-py3.10\lib\site-packages\sumy\nlp\tokenizers.py:172:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
resource_url = 'nltk:tokenizers/punkt/english.pickle', format = 'pickle', cache = True, verbose = False, logic_parser = None, fstruct_reader = None, encoding = None
def load(
resource_url,
format="auto",
cache=True,
verbose=False,
logic_parser=None,
fstruct_reader=None,
encoding=None,
):
"""
Load a given resource from the NLTK data package. The following
resource formats are currently supported:
- ``pickle``
- ``json``
- ``yaml``
- ``cfg`` (context free grammars)
- ``pcfg`` (probabilistic CFGs)
- ``fcfg`` (feature-based CFGs)
- ``fol`` (formulas of First Order Logic)
- ``logic`` (Logical formulas to be parsed by the given logic_parser)
- ``val`` (valuation of First Order Logic model)
- ``text`` (the file contents as a unicode string)
- ``raw`` (the raw file contents as a byte string)
If no format is specified, ``load()`` will attempt to determine a
format based on the resource name's file extension. If that
fails, ``load()`` will raise a ``ValueError`` exception.
For all text formats (everything except ``pickle``, ``json``, ``yaml`` and ``raw``),
it tries to decode the raw contents using UTF-8, and if that doesn't
work, it tries with ISO-8859-1 (Latin-1), unless the ``encoding``
is specified.
:type resource_url: str
:param resource_url: A URL specifying where the resource should be
loaded from. The default protocol is "nltk:", which searches
for the file in the the NLTK data package.
:type cache: bool
:param cache: If true, add this resource to a cache. If load()
finds a resource in its cache, then it will return it from the
cache rather than loading it.
:type verbose: bool
:param verbose: If true, print a message when loading a resource.
Messages are not displayed when a resource is retrieved from
the cache.
:type logic_parser: LogicParser
:param logic_parser: The parser that will be used to parse logical
expressions.
:type fstruct_reader: FeatStructReader
:param fstruct_reader: The parser that will be used to parse the
feature structure of an fcfg.
:type encoding: str
:param encoding: the encoding of the input; only used for text formats.
"""
resource_url = normalize_resource_url(resource_url)
resource_url = add_py3_data(resource_url)
# Determine the format of the resource.
if format == "auto":
resource_url_parts = resource_url.split(".")
ext = resource_url_parts[-1]
if ext == "gz":
ext = resource_url_parts[-2]
format = AUTO_FORMATS.get(ext)
if format is None:
raise ValueError(
"Could not determine format for %s based "
'on its file\nextension; use the "format" '
"argument to specify the format explicitly." % resource_url
)
if format not in FORMATS:
raise ValueError(f"Unknown format type: {format}!")
# If we've cached the resource, then just return it.
if cache:
resource_val = _resource_cache.get((resource_url, format))
if resource_val is not None:
if verbose:
print(f"<<Using cached copy of {resource_url}>>")
return resource_val
# Let the user know what's going on.
if verbose:
print(f"<<Loading {resource_url}>>")
# Load the resource.
> opened_resource = _open(resource_url)
C:\Users\Krish Patel\AppData\Local\pypoetry\Cache\virtualenvs\openadapt-NIwuSzHt-py3.10\lib\site-packages\nltk\data.py:750:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
resource_url = 'nltk:tokenizers/punkt/english.pickle'
def _open(resource_url):
"""
Helper function that returns an open file object for a resource,
given its resource URL. If the given resource URL uses the "nltk:"
protocol, or uses no protocol, then use ``nltk.data.find`` to find
its path, and open it with the given mode; if the resource URL
uses the 'file' protocol, then open the file with the given mode;
otherwise, delegate to ``urllib2.urlopen``.
:type resource_url: str
:param resource_url: A URL specifying where the resource should be
loaded from. The default protocol is "nltk:", which searches
for the file in the the NLTK data package.
"""
resource_url = normalize_resource_url(resource_url)
protocol, path_ = split_resource_url(resource_url)
if protocol is None or protocol.lower() == "nltk":
> return find(path_, path + [""]).open()
C:\Users\Krish Patel\AppData\Local\pypoetry\Cache\virtualenvs\openadapt-NIwuSzHt-py3.10\lib\site-packages\nltk\data.py:876:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
resource_name = 'tokenizers/punkt/english.pickle'
paths = ['C:\\Users\\Krish Patel/nltk_data', 'C:\\Users\\Krish Patel\\AppData\\Local\\pypoetry\\Cache\\virtualenvs\\openadapt-...penadapt-NIwuSzHt-py3.10\\lib\\nltk_data', 'C:\\Users\\Krish Patel\\AppData\\Roaming\\nltk_data', 'C:\\nltk_data', ...]
def find(resource_name, paths=None):
"""
Find the given resource by searching through the directories and
zip files in paths, where a None or empty string specifies an absolute path.
Returns a corresponding path name. If the given resource is not
found, raise a ``LookupError``, whose message gives a pointer to
the installation instructions for the NLTK downloader.
Zip File Handling:
- If ``resource_name`` contains a component with a ``.zip``
extension, then it is assumed to be a zipfile; and the
remaining path components are used to look inside the zipfile.
- If any element of ``nltk.data.path`` has a ``.zip`` extension,
then it is assumed to be a zipfile.
- If a given resource name that does not contain any zipfile
component is not found initially, then ``find()`` will make a
second attempt to find that resource, by replacing each
component *p* in the path with *p.zip/p*. For example, this
allows ``find()`` to map the resource name
``corpora/chat80/cities.pl`` to a zip file path pointer to
``corpora/chat80.zip/chat80/cities.pl``.
- When using ``find()`` to locate a directory contained in a
zipfile, the resource name must end with the forward slash
character. Otherwise, ``find()`` will not locate the
directory.
:type resource_name: str or unicode
:param resource_name: The name of the resource to search for.
Resource names are posix-style relative path names, such as
``corpora/brown``. Directory names will be
automatically converted to a platform-appropriate path separator.
:rtype: str
"""
resource_name = normalize_resource_name(resource_name, True)
# Resolve default paths at runtime in-case the user overrides
# nltk.data.path
if paths is None:
paths = path
# Check if the resource name includes a zipfile name
m = re.match(r"(.*\.zip)/?(.*)$|", resource_name)
zipfile, zipentry = m.groups()
# Check each item in our path
for path_ in paths:
# Is the path item a zipfile?
if path_ and (os.path.isfile(path_) and path_.endswith(".zip")):
try:
return ZipFilePathPointer(path_, resource_name)
except OSError:
# resource not in zipfile
continue
# Is the path item a directory or is resource_name an absolute path?
elif not path_ or os.path.isdir(path_):
if zipfile is None:
p = os.path.join(path_, url2pathname(resource_name))
if os.path.exists(p):
if p.endswith(".gz"):
return GzipFileSystemPathPointer(p)
else:
return FileSystemPathPointer(p)
else:
p = os.path.join(path_, url2pathname(zipfile))
if os.path.exists(p):
try:
return ZipFilePathPointer(p, zipentry)
except OSError:
# resource not in zipfile
continue
# Fallback: if the path doesn't include a zip file, then try
# again, assuming that one of the path components is inside a
# zipfile of the same name.
if zipfile is None:
pieces = resource_name.split("/")
for i in range(len(pieces)):
modified_name = "/".join(pieces[:i] + [pieces[i] + ".zip"] + pieces[i:])
try:
return find(modified_name, paths)
except LookupError:
pass
# Identify the package (i.e. the .zip file) to download.
resource_zipname = resource_name.split("/")[1]
if resource_zipname.endswith(".zip"):
resource_zipname = resource_zipname.rpartition(".")[0]
# Display a friendly error message if the resource wasn't found:
msg = str(
"Resource \33[93m{resource}\033[0m not found.\n"
"Please use the NLTK Downloader to obtain the resource:\n\n"
"\33[31m" # To display red text in terminal.
">>> import nltk\n"
">>> nltk.download('{resource}')\n"
"\033[0m"
).format(resource=resource_zipname)
msg = textwrap_indent(msg)
msg += "\n For more information see: https://www.nltk.org/data.html\n"
msg += "\n Attempted to load \33[93m{resource_name}\033[0m\n".format(
resource_name=resource_name
)
msg += "\n Searched in:" + "".join("\n - %r" % d for d in paths)
sep = "*" * 70
resource_not_found = f"\n{sep}\n{msg}\n{sep}\n"
> raise LookupError(resource_not_found)
E LookupError:
E **********************************************************************
E Resource punkt not found.
E Please use the NLTK Downloader to obtain the resource:
E
E >>> import nltk
E >>> nltk.download('punkt')
E
E For more information see: https://www.nltk.org/data.html
E
E Attempted to load tokenizers/punkt/english.pickle
E
E Searched in:
E - 'C:\\Users\\Krish Patel/nltk_data'
E - 'C:\\Users\\Krish Patel\\AppData\\Local\\pypoetry\\Cache\\virtualenvs\\openadapt-NIwuSzHt-py3.10\\nltk_data'
E - 'C:\\Users\\Krish Patel\\AppData\\Local\\pypoetry\\Cache\\virtualenvs\\openadapt-NIwuSzHt-py3.10\\share\\nltk_data'
E - 'C:\\Users\\Krish Patel\\AppData\\Local\\pypoetry\\Cache\\virtualenvs\\openadapt-NIwuSzHt-py3.10\\lib\\nltk_data'
E - 'C:\\Users\\Krish Patel\\AppData\\Roaming\\nltk_data'
E - 'C:\\nltk_data'
E - 'D:\\nltk_data'
E - 'E:\\nltk_data'
E - ''
E **********************************************************************
C:\Users\Krish Patel\AppData\Local\pypoetry\Cache\virtualenvs\openadapt-NIwuSzHt-py3.10\lib\site-packages\nltk\data.py:583: LookupError
During handling of the above exception, another exception occurred:
def test_summary_sentence():
story = "However, this bottle was not marked “poison,” so Alice ventured to taste it, \
and finding it very nice, (it had, in fact, a sort of mixed flavour of cherry-tart, \
custard, pine-apple, roast turkey, toffee, and hot buttered toast,) \
she very soon finished it off."
> actual = REPLAY.get_summary(story, 1)
tests\openadapt\test_summary.py:37:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
openadapt\strategies\mixins\summary.py:48: in get_summary
parser = PlaintextParser.from_string(text, Tokenizer("english"))
C:\Users\Krish Patel\AppData\Local\pypoetry\Cache\virtualenvs\openadapt-NIwuSzHt-py3.10\lib\site-packages\sumy\nlp\tokenizers.py:160: in __init__
self._sentence_tokenizer = self._get_sentence_tokenizer(tokenizer_language)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = <sumy.nlp.tokenizers.Tokenizer object at 0x0000019241BCBE80>, language = 'english'
def _get_sentence_tokenizer(self, language):
if language in self.SPECIAL_SENTENCE_TOKENIZERS:
return self.SPECIAL_SENTENCE_TOKENIZERS[language]
try:
path = to_string("tokenizers/punkt/%s.pickle") % to_string(language)
return nltk.data.load(path)
except (LookupError, zipfile.BadZipfile) as e:
> raise LookupError(
"NLTK tokenizers are missing or the language is not supported.\n"
"""Download them by following command: python -c "import nltk; nltk.download('punkt')"\n"""
"Original error was:\n" + str(e)
)
E LookupError: NLTK tokenizers are missing or the language is not supported.
E Download them by following command: python -c "import nltk; nltk.download('punkt')"
E Original error was:
E
E **********************************************************************
E Resource punkt not found.
E Please use the NLTK Downloader to obtain the resource:
E
E >>> import nltk
E >>> nltk.download('punkt')
E
E For more information see: https://www.nltk.org/data.html
E
E Attempted to load tokenizers/punkt/english.pickle
E
E Searched in:
E - 'C:\\Users\\Krish Patel/nltk_data'
E - 'C:\\Users\\Krish Patel\\AppData\\Local\\pypoetry\\Cache\\virtualenvs\\openadapt-NIwuSzHt-py3.10\\nltk_data'
E - 'C:\\Users\\Krish Patel\\AppData\\Local\\pypoetry\\Cache\\virtualenvs\\openadapt-NIwuSzHt-py3.10\\share\\nltk_data'
E - 'C:\\Users\\Krish Patel\\AppData\\Local\\pypoetry\\Cache\\virtualenvs\\openadapt-NIwuSzHt-py3.10\\lib\\nltk_data'
E - 'C:\\Users\\Krish Patel\\AppData\\Roaming\\nltk_data'
E - 'C:\\nltk_data'
E - 'D:\\nltk_data'
E - 'E:\\nltk_data'
E - ''
E **********************************************************************
C:\Users\Krish Patel\AppData\Local\pypoetry\Cache\virtualenvs\openadapt-NIwuSzHt-py3.10\lib\site-packages\sumy\nlp\tokenizers.py:174: LookupError
=============================================================================================== warnings summary ================================================================================================
C:\Users\Krish Patel\AppData\Local\pypoetry\Cache\virtualenvs\openadapt-NIwuSzHt-py3.10\lib\site-packages\fuzzywuzzy\fuzz.py:11
C:\Users\Krish Patel\AppData\Local\pypoetry\Cache\virtualenvs\openadapt-NIwuSzHt-py3.10\lib\site-packages\fuzzywuzzy\fuzz.py:11: UserWarning: Using slow pure-python SequenceMatcher. Install python-Levenshtein to remove this warning
warnings.warn('Using slow pure-python SequenceMatcher. Install python-Levenshtein to remove this warning')
C:\Users\Krish Patel\AppData\Local\pypoetry\Cache\virtualenvs\openadapt-NIwuSzHt-py3.10\lib\site-packages\pkg_resources\__init__.py:121
C:\Users\Krish Patel\AppData\Local\pypoetry\Cache\virtualenvs\openadapt-NIwuSzHt-py3.10\lib\site-packages\pkg_resources\__init__.py:121: DeprecationWarning: pkg_resources is deprecated as an API
warnings.warn("pkg_resources is deprecated as an API", DeprecationWarning)
C:\Users\Krish Patel\AppData\Local\pypoetry\Cache\virtualenvs\openadapt-NIwuSzHt-py3.10\lib\site-packages\pkg_resources\__init__.py:2870
C:\Users\Krish Patel\AppData\Local\pypoetry\Cache\virtualenvs\openadapt-NIwuSzHt-py3.10\lib\site-packages\pkg_resources\__init__.py:2870: DeprecationWarning: Deprecated call to `pkg_resources.declare_namespace('mpl_toolkits')`.
Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages
declare_namespace(pkg)
C:\Users\Krish Patel\AppData\Local\pypoetry\Cache\virtualenvs\openadapt-NIwuSzHt-py3.10\lib\site-packages\pkg_resources\__init__.py:2870
C:\Users\Krish Patel\AppData\Local\pypoetry\Cache\virtualenvs\openadapt-NIwuSzHt-py3.10\lib\site-packages\pkg_resources\__init__.py:2870
C:\Users\Krish Patel\AppData\Local\pypoetry\Cache\virtualenvs\openadapt-NIwuSzHt-py3.10\lib\site-packages\pkg_resources\__init__.py:2870
C:\Users\Krish Patel\AppData\Local\pypoetry\Cache\virtualenvs\openadapt-NIwuSzHt-py3.10\lib\site-packages\pkg_resources\__init__.py:2870
C:\Users\Krish Patel\AppData\Local\pypoetry\Cache\virtualenvs\openadapt-NIwuSzHt-py3.10\lib\site-packages\pkg_resources\__init__.py:2870: DeprecationWarning: Deprecated call to `pkg_resources.declare_namespace('sphinxcontrib')`.
Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages
declare_namespace(pkg)
-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
============================================================================================ short test summary info ============================================================================================
FAILED tests/openadapt/test_summary.py::test_summary_empty - LookupError: NLTK tokenizers are missing or the language is not supported.
FAILED tests/openadapt/test_summary.py::test_summary_sentence - LookupError: NLTK tokenizers are missing or the language is not supported.
=================================================================================== 2 failed, 23 passed, 7 warnings in 12.64s ===================================================================================
(openadapt-py3.10) PS P:\OpenAdapt AI - MLDS AI\cloned_repo\OpenAdapt>
When running pytest, I got this error :
`(.venv) C:\Users\jesic\PycharmProjects\PAT>pytest
=========================================== test session starts ===========================================
platform win32 -- Python 3.10.10, pytest-7.1.3, pluggy-1.0.0
rootdir: C:\Users\jesic\PycharmProjects\PAT
plugins: anyio-3.7.0
collected 25 items
tests\openadapt\test_crop.py . [ 4%]
tests\openadapt\test_events.py ....... [ 32%]
tests\openadapt\test_scrub.py F.............. [ 92%]
tests\openadapt\test_summary.py FF [100%]
================================================ FAILURES =================================================
____________________________________________ test_scrub_image _____________________________________________
@run_once
def get_tesseract_version():
"""
Returns LooseVersion object of the Tesseract version
"""
try:
return LooseVersion(
> subprocess.check_output(
[tesseract_cmd, '--version'],
stderr=subprocess.STDOUT,
env=environ,
)
.decode(DEFAULT_ENCODING)
.split()[1]
.lstrip(string.printable[10:]),
)
openadapt\.venv\lib\site-packages\pytesseract\pytesseract.py:383:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
timeout = None, popenargs = (['tesseract', '--version'],)
kwargs = {'env': environ({'ALLUSERSPROFILE': 'C:\\ProgramData', 'APPDATA': 'C:\\Users\\jesic\\AppData\\Roaming', 'CHOCOLATEYINS...NIT_AT_FORK': 'FALSE', 'PYTEST_CURRENT_TEST': 'tests/openadapt/test_scrub.py::test_scrub_image (call)'}), 'stderr': -2}
def check_output(*popenargs, timeout=None, **kwargs):
r"""Run command with arguments and return its output.
If the exit code was non-zero it raises a CalledProcessError. The
CalledProcessError object will have the return code in the returncode
attribute and output in the output attribute.
The arguments are the same as for the Popen constructor. Example:
>>> check_output(["ls", "-l", "/dev/null"])
b'crw-rw-rw- 1 root root 1, 3 Oct 18 2007 /dev/null\n'
The stdout argument is not allowed as it is used internally.
To capture standard error in the result, use stderr=STDOUT.
>>> check_output(["/bin/sh", "-c",
... "ls -l non_existent_file ; exit 0"],
... stderr=STDOUT)
b'ls: non_existent_file: No such file or directory\n'
There is an additional optional argument, "input", allowing you to
pass a string to the subprocess's stdin. If you use this argument
you may not also use the Popen constructor's "stdin" argument, as
it too will be used internally. Example:
>>> check_output(["sed", "-e", "s/foo/bar/"],
... input=b"when in the course of fooman events\n")
b'when in the course of barman events\n'
By default, all communication is in bytes, and therefore any "input"
should be bytes, and the return value will be bytes. If in text mode,
any "input" should be a string, and the return value will be a string
decoded according to locale encoding, or by "encoding" if set. Text mode
is triggered by setting any of text, encoding, errors or universal_newlines.
"""
if 'stdout' in kwargs:
raise ValueError('stdout argument not allowed, it will be overridden.')
if 'input' in kwargs and kwargs['input'] is None:
# Explicitly passing input=None was previously equivalent to passing an
# empty string. That is maintained here for backwards compatibility.
if kwargs.get('universal_newlines') or kwargs.get('text') or kwargs.get('encoding') \
or kwargs.get('errors'):
empty = ''
else:
empty = b''
kwargs['input'] = empty
> return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
**kwargs).stdout
C:\Python310\lib\subprocess.py:421:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
input = None, capture_output = False, timeout = None, check = True
popenargs = (['tesseract', '--version'],)
kwargs = {'env': environ({'ALLUSERSPROFILE': 'C:\\ProgramData', 'APPDATA': 'C:\\Users\\jesic\\AppData\\Roaming', 'CHOCOLATEYINS...'FALSE', 'PYTEST_CURRENT_TEST': 'tests/openadapt/test_scrub.py::test_scrub_image (call)'}), 'stderr': -2, 'stdout': -1}
def run(*popenargs,
input=None, capture_output=False, timeout=None, check=False, **kwargs):
"""Run command with arguments and return a CompletedProcess instance.
The returned instance will have attributes args, returncode, stdout and
stderr. By default, stdout and stderr are not captured, and those attributes
will be None. Pass stdout=PIPE and/or stderr=PIPE in order to capture them,
or pass capture_output=True to capture both.
If check is True and the exit code was non-zero, it raises a
CalledProcessError. The CalledProcessError object will have the return code
in the returncode attribute, and output & stderr attributes if those streams
were captured.
If timeout is given, and the process takes too long, a TimeoutExpired
exception will be raised.
There is an optional argument "input", allowing you to
pass bytes or a string to the subprocess's stdin. If you use this argument
you may not also use the Popen constructor's "stdin" argument, as
it will be used internally.
By default, all communication is in bytes, and therefore any "input" should
be bytes, and the stdout and stderr will be bytes. If in text mode, any
"input" should be a string, and stdout and stderr will be strings decoded
according to locale encoding, or by "encoding" if set. Text mode is
triggered by setting any of text, encoding, errors or universal_newlines.
The other arguments are the same as for the Popen constructor.
"""
if input is not None:
if kwargs.get('stdin') is not None:
raise ValueError('stdin and input arguments may not both be used.')
kwargs['stdin'] = PIPE
if capture_output:
if kwargs.get('stdout') is not None or kwargs.get('stderr') is not None:
raise ValueError('stdout and stderr arguments may not be used '
'with capture_output.')
kwargs['stdout'] = PIPE
kwargs['stderr'] = PIPE
> with Popen(*popenargs, **kwargs) as process:
C:\Python310\lib\subprocess.py:503:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = <Popen: returncode: None args: ['tesseract', '--version']>, args = ['tesseract', '--version']
bufsize = -1, executable = None, stdin = None, stdout = -1, stderr = -2, preexec_fn = None, close_fds = True
shell = False, cwd = None
env = environ({'ALLUSERSPROFILE': 'C:\\ProgramData', 'APPDATA': 'C:\\Users\\jesic\\AppData\\Roaming', 'CHOCOLATEYINSTALL': '... 'True', 'KMP_INIT_AT_FORK': 'FALSE', 'PYTEST_CURRENT_TEST': 'tests/openadapt/test_scrub.py::test_scrub_image (call)'})
universal_newlines = None, startupinfo = None, creationflags = 0, restore_signals = True
start_new_session = False, pass_fds = ()
def __init__(self, args, bufsize=-1, executable=None,
stdin=None, stdout=None, stderr=None,
preexec_fn=None, close_fds=True,
shell=False, cwd=None, env=None, universal_newlines=None,
startupinfo=None, creationflags=0,
restore_signals=True, start_new_session=False,
pass_fds=(), *, user=None, group=None, extra_groups=None,
encoding=None, errors=None, text=None, umask=-1, pipesize=-1):
"""Create new Popen instance."""
_cleanup()
# Held while anything is calling waitpid before returncode has been
# updated to prevent clobbering returncode if wait() or poll() are
# called from multiple threads at once. After acquiring the lock,
# code must re-check self.returncode to see if another thread just
# finished a waitpid() call.
self._waitpid_lock = threading.Lock()
self._input = None
self._communication_started = False
if bufsize is None:
bufsize = -1 # Restore default
if not isinstance(bufsize, int):
raise TypeError("bufsize must be an integer")
if pipesize is None:
pipesize = -1 # Restore default
if not isinstance(pipesize, int):
raise TypeError("pipesize must be an integer")
if _mswindows:
if preexec_fn is not None:
raise ValueError("preexec_fn is not supported on Windows "
"platforms")
else:
# POSIX
if pass_fds and not close_fds:
warnings.warn("pass_fds overriding close_fds.", RuntimeWarning)
close_fds = True
if startupinfo is not None:
raise ValueError("startupinfo is only supported on Windows "
"platforms")
if creationflags != 0:
raise ValueError("creationflags is only supported on Windows "
"platforms")
self.args = args
self.stdin = None
self.stdout = None
self.stderr = None
self.pid = None
self.returncode = None
self.encoding = encoding
self.errors = errors
self.pipesize = pipesize
# Validate the combinations of text and universal_newlines
if (text is not None and universal_newlines is not None
and bool(universal_newlines) != bool(text)):
raise SubprocessError('Cannot disambiguate when both text '
'and universal_newlines are supplied but '
'different. Pass one or the other.')
# Input and output objects. The general principle is like
# this:
#
# Parent Child
# ------ -----
# p2cwrite ---stdin---> p2cread
# c2pread <--stdout--- c2pwrite
# errread <--stderr--- errwrite
#
# On POSIX, the child objects are file descriptors. On
# Windows, these are Windows file handles. The parent objects
# are file descriptors on both platforms. The parent objects
# are -1 when not using PIPEs. The child objects are -1
# when not redirecting.
(p2cread, p2cwrite,
c2pread, c2pwrite,
errread, errwrite) = self._get_handles(stdin, stdout, stderr)
# We wrap OS handles *before* launching the child, otherwise a
# quickly terminating child could make our fds unwrappable
# (see #8458).
if _mswindows:
if p2cwrite != -1:
p2cwrite = msvcrt.open_osfhandle(p2cwrite.Detach(), 0)
if c2pread != -1:
c2pread = msvcrt.open_osfhandle(c2pread.Detach(), 0)
if errread != -1:
errread = msvcrt.open_osfhandle(errread.Detach(), 0)
self.text_mode = encoding or errors or text or universal_newlines
# PEP 597: We suppress the EncodingWarning in subprocess module
# for now (at Python 3.10), because we focus on files for now.
# This will be changed to encoding = io.text_encoding(encoding)
# in the future.
if self.text_mode and encoding is None:
self.encoding = encoding = "locale"
# How long to resume waiting on a child after the first ^C.
# There is no right value for this. The purpose is to be polite
# yet remain good for interactive users trying to exit a tool.
self._sigint_wait_secs = 0.25 # 1/xkcd221.getRandomNumber()
self._closed_child_pipe_fds = False
if self.text_mode:
if bufsize == 1:
line_buffering = True
# Use the default buffer size for the underlying binary streams
# since they don't support line buffering.
bufsize = -1
else:
line_buffering = False
gid = None
if group is not None:
if not hasattr(os, 'setregid'):
raise ValueError("The 'group' parameter is not supported on the "
"current platform")
elif isinstance(group, str):
try:
import grp
except ImportError:
raise ValueError("The group parameter cannot be a string "
"on systems without the grp module")
gid = grp.getgrnam(group).gr_gid
elif isinstance(group, int):
gid = group
else:
raise TypeError("Group must be a string or an integer, not {}"
.format(type(group)))
if gid < 0:
raise ValueError(f"Group ID cannot be negative, got {gid}")
gids = None
if extra_groups is not None:
if not hasattr(os, 'setgroups'):
raise ValueError("The 'extra_groups' parameter is not "
"supported on the current platform")
elif isinstance(extra_groups, str):
raise ValueError("Groups must be a list, not a string")
gids = []
for extra_group in extra_groups:
if isinstance(extra_group, str):
try:
import grp
except ImportError:
raise ValueError("Items in extra_groups cannot be "
"strings on systems without the "
"grp module")
gids.append(grp.getgrnam(extra_group).gr_gid)
elif isinstance(extra_group, int):
gids.append(extra_group)
else:
raise TypeError("Items in extra_groups must be a string "
"or integer, not {}"
.format(type(extra_group)))
# make sure that the gids are all positive here so we can do less
# checking in the C code
for gid_check in gids:
if gid_check < 0:
raise ValueError(f"Group ID cannot be negative, got {gid_check}")
uid = None
if user is not None:
if not hasattr(os, 'setreuid'):
raise ValueError("The 'user' parameter is not supported on "
"the current platform")
elif isinstance(user, str):
try:
import pwd
except ImportError:
raise ValueError("The user parameter cannot be a string "
"on systems without the pwd module")
uid = pwd.getpwnam(user).pw_uid
elif isinstance(user, int):
uid = user
else:
raise TypeError("User must be a string or an integer")
if uid < 0:
raise ValueError(f"User ID cannot be negative, got {uid}")
try:
if p2cwrite != -1:
self.stdin = io.open(p2cwrite, 'wb', bufsize)
if self.text_mode:
self.stdin = io.TextIOWrapper(self.stdin, write_through=True,
line_buffering=line_buffering,
encoding=encoding, errors=errors)
if c2pread != -1:
self.stdout = io.open(c2pread, 'rb', bufsize)
if self.text_mode:
self.stdout = io.TextIOWrapper(self.stdout,
encoding=encoding, errors=errors)
if errread != -1:
self.stderr = io.open(errread, 'rb', bufsize)
if self.text_mode:
self.stderr = io.TextIOWrapper(self.stderr,
encoding=encoding, errors=errors)
> self._execute_child(args, executable, preexec_fn, close_fds,
pass_fds, cwd, env,
startupinfo, creationflags, shell,
p2cread, p2cwrite,
c2pread, c2pwrite,
errread, errwrite,
restore_signals,
gid, gids, uid, umask,
start_new_session)
C:\Python310\lib\subprocess.py:971:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = <Popen: returncode: None args: ['tesseract', '--version']>, args = 'tesseract --version'
executable = None, preexec_fn = None, close_fds = False, pass_fds = (), cwd = None
env = environ({'ALLUSERSPROFILE': 'C:\\ProgramData', 'APPDATA': 'C:\\Users\\jesic\\AppData\\Roaming', 'CHOCOLATEYINSTALL': '... 'True', 'KMP_INIT_AT_FORK': 'FALSE', 'PYTEST_CURRENT_TEST': 'tests/openadapt/test_scrub.py::test_scrub_image (call)'})
startupinfo = <subprocess.STARTUPINFO object at 0x00000172C2B17E50>, creationflags = 0, shell = False
p2cread = Handle(7016), p2cwrite = -1, c2pread = 15, c2pwrite = Handle(6992), errread = -1
errwrite = Handle(2336), unused_restore_signals = True, unused_gid = None, unused_gids = None
unused_uid = None, unused_umask = -1, unused_start_new_session = False
def _execute_child(self, args, executable, preexec_fn, close_fds,
pass_fds, cwd, env,
startupinfo, creationflags, shell,
p2cread, p2cwrite,
c2pread, c2pwrite,
errread, errwrite,
unused_restore_signals,
unused_gid, unused_gids, unused_uid,
unused_umask,
unused_start_new_session):
"""Execute program (MS Windows version)"""
assert not pass_fds, "pass_fds not supported on Windows."
if isinstance(args, str):
pass
elif isinstance(args, bytes):
if shell:
raise TypeError('bytes args is not allowed on Windows')
args = list2cmdline([args])
elif isinstance(args, os.PathLike):
if shell:
raise TypeError('path-like args is not allowed when '
'shell is true')
args = list2cmdline([args])
else:
args = list2cmdline(args)
if executable is not None:
executable = os.fsdecode(executable)
# Process startup details
if startupinfo is None:
startupinfo = STARTUPINFO()
else:
# bpo-34044: Copy STARTUPINFO since it is modified above,
# so the caller can reuse it multiple times.
startupinfo = startupinfo.copy()
use_std_handles = -1 not in (p2cread, c2pwrite, errwrite)
if use_std_handles:
startupinfo.dwFlags |= _winapi.STARTF_USESTDHANDLES
startupinfo.hStdInput = p2cread
startupinfo.hStdOutput = c2pwrite
startupinfo.hStdError = errwrite
attribute_list = startupinfo.lpAttributeList
have_handle_list = bool(attribute_list and
"handle_list" in attribute_list and
attribute_list["handle_list"])
# If we were given an handle_list or need to create one
if have_handle_list or (use_std_handles and close_fds):
if attribute_list is None:
attribute_list = startupinfo.lpAttributeList = {}
handle_list = attribute_list["handle_list"] = \
list(attribute_list.get("handle_list", []))
if use_std_handles:
handle_list += [int(p2cread), int(c2pwrite), int(errwrite)]
handle_list[:] = self._filter_handle_list(handle_list)
if handle_list:
if not close_fds:
warnings.warn("startupinfo.lpAttributeList['handle_list'] "
"overriding close_fds", RuntimeWarning)
# When using the handle_list we always request to inherit
# handles but the only handles that will be inherited are
# the ones in the handle_list
close_fds = False
if shell:
startupinfo.dwFlags |= _winapi.STARTF_USESHOWWINDOW
startupinfo.wShowWindow = _winapi.SW_HIDE
comspec = os.environ.get("COMSPEC", "cmd.exe")
args = '{} /c "{}"'.format (comspec, args)
if cwd is not None:
cwd = os.fsdecode(cwd)
sys.audit("subprocess.Popen", executable, args, cwd, env)
# Start the process
try:
> hp, ht, pid, tid = _winapi.CreateProcess(executable, args,
# no special security
None, None,
int(not close_fds),
creationflags,
env,
cwd,
startupinfo)
E FileNotFoundError: [WinError 2] The system cannot find the file specified
C:\Python310\lib\subprocess.py:1440: FileNotFoundError
During handling of the above exception, another exception occurred:
def test_scrub_image() -> None:
"""
Test that the scrubbed image data is different
"""
warnings.filterwarnings("ignore", category=DeprecationWarning)
# Read test image data from file
test_image_path = "assets/test_scrub_image.png"
with open(test_image_path, "rb") as file:
test_image_data = file.read()
# Convert image data to PIL Image object
test_image = Image.open(BytesIO(test_image_data))
# Scrub the image
> scrubbed_image = scrub.scrub_image(test_image)
tests\openadapt\test_scrub.py:40:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
openadapt\scrub.py:103: in scrub_image
redacted_image = IMAGE_REDACTOR.redact(
openadapt\.venv\lib\site-packages\presidio_image_redactor\image_redactor_engine.py:45: in redact
bboxes = self.image_analyzer_engine.analyze(
openadapt\.venv\lib\site-packages\presidio_image_redactor\image_analyzer_engine.py:44: in analyze
ocr_result = self.ocr.perform_ocr(image, **perform_ocr_kwargs)
openadapt\.venv\lib\site-packages\presidio_image_redactor\tesseract_ocr.py:18: in perform_ocr
return pytesseract.image_to_data(image, output_type=output_type, **kwargs)
openadapt\.venv\lib\site-packages\pytesseract\pytesseract.py:507: in image_to_data
if get_tesseract_version() < '3.05':
openadapt\.venv\lib\site-packages\pytesseract\pytesseract.py:148: in wrapper
wrapper._result = func(*args, **kwargs)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
@run_once
def get_tesseract_version():
"""
Returns LooseVersion object of the Tesseract version
"""
try:
return LooseVersion(
subprocess.check_output(
[tesseract_cmd, '--version'],
stderr=subprocess.STDOUT,
env=environ,
)
.decode(DEFAULT_ENCODING)
.split()[1]
.lstrip(string.printable[10:]),
)
except OSError:
> raise TesseractNotFoundError()
E pytesseract.pytesseract.TesseractNotFoundError: tesseract is not installed or it's not in your PATH. See README file for more information.
openadapt\.venv\lib\site-packages\pytesseract\pytesseract.py:393: TesseractNotFoundError
___________________________________________ test_summary_empty ____________________________________________
self = <sumy.nlp.tokenizers.Tokenizer object at 0x00000172C4C39420>, language = 'english'
def _get_sentence_tokenizer(self, language):
if language in self.SPECIAL_SENTENCE_TOKENIZERS:
return self.SPECIAL_SENTENCE_TOKENIZERS[language]
try:
path = to_string("tokenizers/punkt/%s.pickle") % to_string(language)
> return nltk.data.load(path)
openadapt\.venv\lib\site-packages\sumy\nlp\tokenizers.py:172:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
resource_url = 'nltk:tokenizers/punkt/english.pickle', format = 'pickle', cache = True, verbose = False
logic_parser = None, fstruct_reader = None, encoding = None
def load(
resource_url,
format="auto",
cache=True,
verbose=False,
logic_parser=None,
fstruct_reader=None,
encoding=None,
):
"""
Load a given resource from the NLTK data package. The following
resource formats are currently supported:
- ``pickle``
- ``json``
- ``yaml``
- ``cfg`` (context free grammars)
- ``pcfg`` (probabilistic CFGs)
- ``fcfg`` (feature-based CFGs)
- ``fol`` (formulas of First Order Logic)
- ``logic`` (Logical formulas to be parsed by the given logic_parser)
- ``val`` (valuation of First Order Logic model)
- ``text`` (the file contents as a unicode string)
- ``raw`` (the raw file contents as a byte string)
If no format is specified, ``load()`` will attempt to determine a
format based on the resource name's file extension. If that
fails, ``load()`` will raise a ``ValueError`` exception.
For all text formats (everything except ``pickle``, ``json``, ``yaml`` and ``raw``),
it tries to decode the raw contents using UTF-8, and if that doesn't
work, it tries with ISO-8859-1 (Latin-1), unless the ``encoding``
is specified.
:type resource_url: str
:param resource_url: A URL specifying where the resource should be
loaded from. The default protocol is "nltk:", which searches
for the file in the the NLTK data package.
:type cache: bool
:param cache: If true, add this resource to a cache. If load()
finds a resource in its cache, then it will return it from the
cache rather than loading it.
:type verbose: bool
:param verbose: If true, print a message when loading a resource.
Messages are not displayed when a resource is retrieved from
the cache.
:type logic_parser: LogicParser
:param logic_parser: The parser that will be used to parse logical
expressions.
:type fstruct_reader: FeatStructReader
:param fstruct_reader: The parser that will be used to parse the
feature structure of an fcfg.
:type encoding: str
:param encoding: the encoding of the input; only used for text formats.
"""
resource_url = normalize_resource_url(resource_url)
resource_url = add_py3_data(resource_url)
# Determine the format of the resource.
if format == "auto":
resource_url_parts = resource_url.split(".")
ext = resource_url_parts[-1]
if ext == "gz":
ext = resource_url_parts[-2]
format = AUTO_FORMATS.get(ext)
if format is None:
raise ValueError(
"Could not determine format for %s based "
'on its file\nextension; use the "format" '
"argument to specify the format explicitly." % resource_url
)
if format not in FORMATS:
raise ValueError(f"Unknown format type: {format}!")
# If we've cached the resource, then just return it.
if cache:
resource_val = _resource_cache.get((resource_url, format))
if resource_val is not None:
if verbose:
print(f"<<Using cached copy of {resource_url}>>")
return resource_val
# Let the user know what's going on.
if verbose:
print(f"<<Loading {resource_url}>>")
# Load the resource.
> opened_resource = _open(resource_url)
openadapt\.venv\lib\site-packages\nltk\data.py:750:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
resource_url = 'nltk:tokenizers/punkt/english.pickle'
def _open(resource_url):
"""
Helper function that returns an open file object for a resource,
given its resource URL. If the given resource URL uses the "nltk:"
protocol, or uses no protocol, then use ``nltk.data.find`` to find
its path, and open it with the given mode; if the resource URL
uses the 'file' protocol, then open the file with the given mode;
otherwise, delegate to ``urllib2.urlopen``.
:type resource_url: str
:param resource_url: A URL specifying where the resource should be
loaded from. The default protocol is "nltk:", which searches
for the file in the the NLTK data package.
"""
resource_url = normalize_resource_url(resource_url)
protocol, path_ = split_resource_url(resource_url)
if protocol is None or protocol.lower() == "nltk":
> return find(path_, path + [""]).open()
openadapt\.venv\lib\site-packages\nltk\data.py:876:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
resource_name = 'tokenizers/punkt/english.pickle'
paths = ['C:\\Users\\jesic/nltk_data', 'C:\\Users\\jesic\\PycharmProjects\\PAT\\openadapt\\.venv\\nltk_data', 'C:\\Users\\jesi...rojects\\PAT\\openadapt\\.venv\\lib\\nltk_data', 'C:\\Users\\jesic\\AppData\\Roaming\\nltk_data', 'C:\\nltk_data', ...]
def find(resource_name, paths=None):
"""
Find the given resource by searching through the directories and
zip files in paths, where a None or empty string specifies an absolute path.
Returns a corresponding path name. If the given resource is not
found, raise a ``LookupError``, whose message gives a pointer to
the installation instructions for the NLTK downloader.
Zip File Handling:
- If ``resource_name`` contains a component with a ``.zip``
extension, then it is assumed to be a zipfile; and the
remaining path components are used to look inside the zipfile.
- If any element of ``nltk.data.path`` has a ``.zip`` extension,
then it is assumed to be a zipfile.
- If a given resource name that does not contain any zipfile
component is not found initially, then ``find()`` will make a
second attempt to find that resource, by replacing each
component *p* in the path with *p.zip/p*. For example, this
allows ``find()`` to map the resource name
``corpora/chat80/cities.pl`` to a zip file path pointer to
``corpora/chat80.zip/chat80/cities.pl``.
- When using ``find()`` to locate a directory contained in a
zipfile, the resource name must end with the forward slash
character. Otherwise, ``find()`` will not locate the
directory.
:type resource_name: str or unicode
:param resource_name: The name of the resource to search for.
Resource names are posix-style relative path names, such as
``corpora/brown``. Directory names will be
automatically converted to a platform-appropriate path separator.
:rtype: str
"""
resource_name = normalize_resource_name(resource_name, True)
# Resolve default paths at runtime in-case the user overrides
# nltk.data.path
if paths is None:
paths = path
# Check if the resource name includes a zipfile name
m = re.match(r"(.*\.zip)/?(.*)$|", resource_name)
zipfile, zipentry = m.groups()
# Check each item in our path
for path_ in paths:
# Is the path item a zipfile?
if path_ and (os.path.isfile(path_) and path_.endswith(".zip")):
try:
return ZipFilePathPointer(path_, resource_name)
except OSError:
# resource not in zipfile
continue
# Is the path item a directory or is resource_name an absolute path?
elif not path_ or os.path.isdir(path_):
if zipfile is None:
p = os.path.join(path_, url2pathname(resource_name))
if os.path.exists(p):
if p.endswith(".gz"):
return GzipFileSystemPathPointer(p)
else:
return FileSystemPathPointer(p)
else:
p = os.path.join(path_, url2pathname(zipfile))
if os.path.exists(p):
try:
return ZipFilePathPointer(p, zipentry)
except OSError:
# resource not in zipfile
continue
# Fallback: if the path doesn't include a zip file, then try
# again, assuming that one of the path components is inside a
# zipfile of the same name.
if zipfile is None:
pieces = resource_name.split("/")
for i in range(len(pieces)):
modified_name = "/".join(pieces[:i] + [pieces[i] + ".zip"] + pieces[i:])
try:
return find(modified_name, paths)
except LookupError:
pass
# Identify the package (i.e. the .zip file) to download.
resource_zipname = resource_name.split("/")[1]
if resource_zipname.endswith(".zip"):
resource_zipname = resource_zipname.rpartition(".")[0]
# Display a friendly error message if the resource wasn't found:
msg = str(
"Resource \33[93m{resource}\033[0m not found.\n"
"Please use the NLTK Downloader to obtain the resource:\n\n"
"\33[31m" # To display red text in terminal.
">>> import nltk\n"
">>> nltk.download('{resource}')\n"
"\033[0m"
).format(resource=resource_zipname)
msg = textwrap_indent(msg)
msg += "\n For more information see: https://www.nltk.org/data.html\n"
msg += "\n Attempted to load \33[93m{resource_name}\033[0m\n".format(
resource_name=resource_name
)
msg += "\n Searched in:" + "".join("\n - %r" % d for d in paths)
sep = "*" * 70
resource_not_found = f"\n{sep}\n{msg}\n{sep}\n"
> raise LookupError(resource_not_found)
E LookupError:
E **********************************************************************
E Resource punkt not found.
E Please use the NLTK Downloader to obtain the resource:
E
E >>> import nltk
E >>> nltk.download('punkt')
E
E For more information see: https://www.nltk.org/data.html
E
E Attempted to load tokenizers/punkt/english.pickle
E
E Searched in:
E - 'C:\\Users\\jesic/nltk_data'
E - 'C:\\Users\\jesic\\PycharmProjects\\PAT\\openadapt\\.venv\\nltk_data'
E - 'C:\\Users\\jesic\\PycharmProjects\\PAT\\openadapt\\.venv\\share\\nltk_data'
E - 'C:\\Users\\jesic\\PycharmProjects\\PAT\\openadapt\\.venv\\lib\\nltk_data'
E - 'C:\\Users\\jesic\\AppData\\Roaming\\nltk_data'
E - 'C:\\nltk_data'
E - 'D:\\nltk_data'
E - 'E:\\nltk_data'
E - ''
E **********************************************************************
openadapt\.venv\lib\site-packages\nltk\data.py:583: LookupError
During handling of the above exception, another exception occurred:
def test_summary_empty():
empty_text = ""
> actual = REPLAY.get_summary(empty_text, 1)
tests\openadapt\test_summary.py:28:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
openadapt\strategies\mixins\summary.py:48: in get_summary
parser = PlaintextParser.from_string(text, Tokenizer("english"))
openadapt\.venv\lib\site-packages\sumy\nlp\tokenizers.py:160: in __init__
self._sentence_tokenizer = self._get_sentence_tokenizer(tokenizer_language)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = <sumy.nlp.tokenizers.Tokenizer object at 0x00000172C4C39420>, language = 'english'
def _get_sentence_tokenizer(self, language):
if language in self.SPECIAL_SENTENCE_TOKENIZERS:
return self.SPECIAL_SENTENCE_TOKENIZERS[language]
try:
path = to_string("tokenizers/punkt/%s.pickle") % to_string(language)
return nltk.data.load(path)
except (LookupError, zipfile.BadZipfile) as e:
> raise LookupError(
"NLTK tokenizers are missing or the language is not supported.\n"
"""Download them by following command: python -c "import nltk; nltk.download('punkt')"\n"""
"Original error was:\n" + str(e)
)
E LookupError: NLTK tokenizers are missing or the language is not supported.
E Download them by following command: python -c "import nltk; nltk.download('punkt')"
E Original error was:
E
E **********************************************************************
E Resource punkt not found.
E Please use the NLTK Downloader to obtain the resource:
E
E >>> import nltk
E >>> nltk.download('punkt')
E
E For more information see: https://www.nltk.org/data.html
E
E Attempted to load tokenizers/punkt/english.pickle
E
E Searched in:
E - 'C:\\Users\\jesic/nltk_data'
E - 'C:\\Users\\jesic\\PycharmProjects\\PAT\\openadapt\\.venv\\nltk_data'
E - 'C:\\Users\\jesic\\PycharmProjects\\PAT\\openadapt\\.venv\\share\\nltk_data'
E - 'C:\\Users\\jesic\\PycharmProjects\\PAT\\openadapt\\.venv\\lib\\nltk_data'
E - 'C:\\Users\\jesic\\AppData\\Roaming\\nltk_data'
E - 'C:\\nltk_data'
E - 'D:\\nltk_data'
E - 'E:\\nltk_data'
E - ''
E **********************************************************************
openadapt\.venv\lib\site-packages\sumy\nlp\tokenizers.py:174: LookupError
__________________________________________ test_summary_sentence __________________________________________
self = <sumy.nlp.tokenizers.Tokenizer object at 0x00000172C712FCA0>, language = 'english'
def _get_sentence_tokenizer(self, language):
if language in self.SPECIAL_SENTENCE_TOKENIZERS:
return self.SPECIAL_SENTENCE_TOKENIZERS[language]
try:
path = to_string("tokenizers/punkt/%s.pickle") % to_string(language)
> return nltk.data.load(path)
openadapt\.venv\lib\site-packages\sumy\nlp\tokenizers.py:172:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
resource_url = 'nltk:tokenizers/punkt/english.pickle', format = 'pickle', cache = True, verbose = False
logic_parser = None, fstruct_reader = None, encoding = None
def load(
resource_url,
format="auto",
cache=True,
verbose=False,
logic_parser=None,
fstruct_reader=None,
encoding=None,
):
"""
Load a given resource from the NLTK data package. The following
resource formats are currently supported:
- ``pickle``
- ``json``
- ``yaml``
- ``cfg`` (context free grammars)
- ``pcfg`` (probabilistic CFGs)
- ``fcfg`` (feature-based CFGs)
- ``fol`` (formulas of First Order Logic)
- ``logic`` (Logical formulas to be parsed by the given logic_parser)
- ``val`` (valuation of First Order Logic model)
- ``text`` (the file contents as a unicode string)
- ``raw`` (the raw file contents as a byte string)
If no format is specified, ``load()`` will attempt to determine a
format based on the resource name's file extension. If that
fails, ``load()`` will raise a ``ValueError`` exception.
For all text formats (everything except ``pickle``, ``json``, ``yaml`` and ``raw``),
it tries to decode the raw contents using UTF-8, and if that doesn't
work, it tries with ISO-8859-1 (Latin-1), unless the ``encoding``
is specified.
:type resource_url: str
:param resource_url: A URL specifying where the resource should be
loaded from. The default protocol is "nltk:", which searches
for the file in the the NLTK data package.
:type cache: bool
:param cache: If true, add this resource to a cache. If load()
finds a resource in its cache, then it will return it from the
cache rather than loading it.
:type verbose: bool
:param verbose: If true, print a message when loading a resource.
Messages are not displayed when a resource is retrieved from
the cache.
:type logic_parser: LogicParser
:param logic_parser: The parser that will be used to parse logical
expressions.
:type fstruct_reader: FeatStructReader
:param fstruct_reader: The parser that will be used to parse the
feature structure of an fcfg.
:type encoding: str
:param encoding: the encoding of the input; only used for text formats.
"""
resource_url = normalize_resource_url(resource_url)
resource_url = add_py3_data(resource_url)
# Determine the format of the resource.
if format == "auto":
resource_url_parts = resource_url.split(".")
ext = resource_url_parts[-1]
if ext == "gz":
ext = resource_url_parts[-2]
format = AUTO_FORMATS.get(ext)
if format is None:
raise ValueError(
"Could not determine format for %s based "
'on its file\nextension; use the "format" '
"argument to specify the format explicitly." % resource_url
)
if format not in FORMATS:
raise ValueError(f"Unknown format type: {format}!")
# If we've cached the resource, then just return it.
if cache:
resource_val = _resource_cache.get((resource_url, format))
if resource_val is not None:
if verbose:
print(f"<<Using cached copy of {resource_url}>>")
return resource_val
# Let the user know what's going on.
if verbose:
print(f"<<Loading {resource_url}>>")
# Load the resource.
> opened_resource = _open(resource_url)
openadapt\.venv\lib\site-packages\nltk\data.py:750:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
resource_url = 'nltk:tokenizers/punkt/english.pickle'
def _open(resource_url):
"""
Helper function that returns an open file object for a resource,
given its resource URL. If the given resource URL uses the "nltk:"
protocol, or uses no protocol, then use ``nltk.data.find`` to find
its path, and open it with the given mode; if the resource URL
uses the 'file' protocol, then open the file with the given mode;
otherwise, delegate to ``urllib2.urlopen``.
:type resource_url: str
:param resource_url: A URL specifying where the resource should be
loaded from. The default protocol is "nltk:", which searches
for the file in the the NLTK data package.
"""
resource_url = normalize_resource_url(resource_url)
protocol, path_ = split_resource_url(resource_url)
if protocol is None or protocol.lower() == "nltk":
> return find(path_, path + [""]).open()
openadapt\.venv\lib\site-packages\nltk\data.py:876:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
resource_name = 'tokenizers/punkt/english.pickle'
paths = ['C:\\Users\\jesic/nltk_data', 'C:\\Users\\jesic\\PycharmProjects\\PAT\\openadapt\\.venv\\nltk_data', 'C:\\Users\\jesi...rojects\\PAT\\openadapt\\.venv\\lib\\nltk_data', 'C:\\Users\\jesic\\AppData\\Roaming\\nltk_data', 'C:\\nltk_data', ...]
def find(resource_name, paths=None):
"""
Find the given resource by searching through the directories and
zip files in paths, where a None or empty string specifies an absolute path.
Returns a corresponding path name. If the given resource is not
found, raise a ``LookupError``, whose message gives a pointer to
the installation instructions for the NLTK downloader.
Zip File Handling:
- If ``resource_name`` contains a component with a ``.zip``
extension, then it is assumed to be a zipfile; and the
remaining path components are used to look inside the zipfile.
- If any element of ``nltk.data.path`` has a ``.zip`` extension,
then it is assumed to be a zipfile.
- If a given resource name that does not contain any zipfile
component is not found initially, then ``find()`` will make a
second attempt to find that resource, by replacing each
component *p* in the path with *p.zip/p*. For example, this
allows ``find()`` to map the resource name
``corpora/chat80/cities.pl`` to a zip file path pointer to
``corpora/chat80.zip/chat80/cities.pl``.
- When using ``find()`` to locate a directory contained in a
zipfile, the resource name must end with the forward slash
character. Otherwise, ``find()`` will not locate the
directory.
:type resource_name: str or unicode
:param resource_name: The name of the resource to search for.
Resource names are posix-style relative path names, such as
``corpora/brown``. Directory names will be
automatically converted to a platform-appropriate path separator.
:rtype: str
"""
resource_name = normalize_resource_name(resource_name, True)
# Resolve default paths at runtime in-case the user overrides
# nltk.data.path
if paths is None:
paths = path
# Check if the resource name includes a zipfile name
m = re.match(r"(.*\.zip)/?(.*)$|", resource_name)
zipfile, zipentry = m.groups()
# Check each item in our path
for path_ in paths:
# Is the path item a zipfile?
if path_ and (os.path.isfile(path_) and path_.endswith(".zip")):
try:
return ZipFilePathPointer(path_, resource_name)
except OSError:
# resource not in zipfile
continue
# Is the path item a directory or is resource_name an absolute path?
elif not path_ or os.path.isdir(path_):
if zipfile is None:
p = os.path.join(path_, url2pathname(resource_name))
if os.path.exists(p):
if p.endswith(".gz"):
return GzipFileSystemPathPointer(p)
else:
return FileSystemPathPointer(p)
else:
p = os.path.join(path_, url2pathname(zipfile))
if os.path.exists(p):
try:
return ZipFilePathPointer(p, zipentry)
except OSError:
# resource not in zipfile
continue
# Fallback: if the path doesn't include a zip file, then try
# again, assuming that one of the path components is inside a
# zipfile of the same name.
if zipfile is None:
pieces = resource_name.split("/")
for i in range(len(pieces)):
modified_name = "/".join(pieces[:i] + [pieces[i] + ".zip"] + pieces[i:])
try:
return find(modified_name, paths)
except LookupError:
pass
# Identify the package (i.e. the .zip file) to download.
resource_zipname = resource_name.split("/")[1]
if resource_zipname.endswith(".zip"):
resource_zipname = resource_zipname.rpartition(".")[0]
# Display a friendly error message if the resource wasn't found:
msg = str(
"Resource \33[93m{resource}\033[0m not found.\n"
"Please use the NLTK Downloader to obtain the resource:\n\n"
"\33[31m" # To display red text in terminal.
">>> import nltk\n"
">>> nltk.download('{resource}')\n"
"\033[0m"
).format(resource=resource_zipname)
msg = textwrap_indent(msg)
msg += "\n For more information see: https://www.nltk.org/data.html\n"
msg += "\n Attempted to load \33[93m{resource_name}\033[0m\n".format(
resource_name=resource_name
)
msg += "\n Searched in:" + "".join("\n - %r" % d for d in paths)
sep = "*" * 70
resource_not_found = f"\n{sep}\n{msg}\n{sep}\n"
> raise LookupError(resource_not_found)
E LookupError:
E **********************************************************************
E Resource punkt not found.
E Please use the NLTK Downloader to obtain the resource:
E
E >>> import nltk
E >>> nltk.download('punkt')
E
E For more information see: https://www.nltk.org/data.html
E
E Attempted to load tokenizers/punkt/english.pickle
E
E Searched in:
E - 'C:\\Users\\jesic/nltk_data'
E - 'C:\\Users\\jesic\\PycharmProjects\\PAT\\openadapt\\.venv\\nltk_data'
E - 'C:\\Users\\jesic\\PycharmProjects\\PAT\\openadapt\\.venv\\share\\nltk_data'
E - 'C:\\Users\\jesic\\PycharmProjects\\PAT\\openadapt\\.venv\\lib\\nltk_data'
E - 'C:\\Users\\jesic\\AppData\\Roaming\\nltk_data'
E - 'C:\\nltk_data'
E - 'D:\\nltk_data'
E - 'E:\\nltk_data'
E - ''
E **********************************************************************
openadapt\.venv\lib\site-packages\nltk\data.py:583: LookupError
During handling of the above exception, another exception occurred:
def test_summary_sentence():
story = "However, this bottle was not marked “poison,” so Alice ventured to taste it, \
and finding it very nice, (it had, in fact, a sort of mixed flavour of cherry-tart, \
custard, pine-apple, roast turkey, toffee, and hot buttered toast,) \
she very soon finished it off."
> actual = REPLAY.get_summary(story, 1)
tests\openadapt\test_summary.py:37:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
openadapt\strategies\mixins\summary.py:48: in get_summary
parser = PlaintextParser.from_string(text, Tokenizer("english"))
openadapt\.venv\lib\site-packages\sumy\nlp\tokenizers.py:160: in __init__
self._sentence_tokenizer = self._get_sentence_tokenizer(tokenizer_language)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = <sumy.nlp.tokenizers.Tokenizer object at 0x00000172C712FCA0>, language = 'english'
def _get_sentence_tokenizer(self, language):
if language in self.SPECIAL_SENTENCE_TOKENIZERS:
return self.SPECIAL_SENTENCE_TOKENIZERS[language]
try:
path = to_string("tokenizers/punkt/%s.pickle") % to_string(language)
return nltk.data.load(path)
except (LookupError, zipfile.BadZipfile) as e:
> raise LookupError(
"NLTK tokenizers are missing or the language is not supported.\n"
"""Download them by following command: python -c "import nltk; nltk.download('punkt')"\n"""
"Original error was:\n" + str(e)
)
E LookupError: NLTK tokenizers are missing or the language is not supported.
E Download them by following command: python -c "import nltk; nltk.download('punkt')"
E Original error was:
E
E **********************************************************************
E Resource punkt not found.
E Please use the NLTK Downloader to obtain the resource:
E
E >>> import nltk
E >>> nltk.download('punkt')
E
E For more information see: https://www.nltk.org/data.html
E
E Attempted to load tokenizers/punkt/english.pickle
E
E Searched in:
E - 'C:\\Users\\jesic/nltk_data'
E - 'C:\\Users\\jesic\\PycharmProjects\\PAT\\openadapt\\.venv\\nltk_data'
E - 'C:\\Users\\jesic\\PycharmProjects\\PAT\\openadapt\\.venv\\share\\nltk_data'
E - 'C:\\Users\\jesic\\PycharmProjects\\PAT\\openadapt\\.venv\\lib\\nltk_data'
E - 'C:\\Users\\jesic\\AppData\\Roaming\\nltk_data'
E - 'C:\\nltk_data'
E - 'D:\\nltk_data'
E - 'E:\\nltk_data'
E - ''
E **********************************************************************
openadapt\.venv\lib\site-packages\sumy\nlp\tokenizers.py:174: LookupError
============================================ warnings summary =============================================
openadapt\.venv\lib\site-packages\fuzzywuzzy\fuzz.py:11
C:\Users\jesic\PycharmProjects\PAT\openadapt\.venv\lib\site-packages\fuzzywuzzy\fuzz.py:11: UserWarning: Using slow pure-python SequenceMatcher. Install python-Levenshtein to remove this warning
warnings.warn('Using slow pure-python SequenceMatcher. Install python-Levenshtein to remove this warning')
openadapt\.venv\lib\site-packages\onnxruntime\capi\_pybind_state.py:28
C:\Users\jesic\PycharmProjects\PAT\openadapt\.venv\lib\site-packages\onnxruntime\capi\_pybind_state.py:28: DeprecationWarning: invalid escape sequence '\S'
"(other than %SystemRoot%\System32), "
openadapt\.venv\lib\site-packages\pycountry\__init__.py:10
C:\Users\jesic\PycharmProjects\PAT\openadapt\.venv\lib\site-packages\pycountry\__init__.py:10: DeprecationWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html
import pkg_resources
openadapt\.venv\lib\site-packages\pkg_resources\__init__.py:2871: 10 warnings
C:\Users\jesic\PycharmProjects\PAT\openadapt\.venv\lib\site-packages\pkg_resources\__init__.py:2871: DeprecationWarning: Deprecated call to `pkg_resources.declare_namespace('google')`.
Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages
declare_namespace(pkg)
openadapt\.venv\lib\site-packages\pkg_resources\__init__.py:2871
openadapt\.venv\lib\site-packages\pkg_resources\__init__.py:2871
openadapt\.venv\lib\site-packages\pkg_resources\__init__.py:2871
openadapt\.venv\lib\site-packages\pkg_resources\__init__.py:2871
openadapt\.venv\lib\site-packages\pkg_resources\__init__.py:2871
C:\Users\jesic\PycharmProjects\PAT\openadapt\.venv\lib\site-packages\pkg_resources\__init__.py:2871: DeprecationWarning: Deprecated call to `pkg_resources.declare_namespace('google.cloud')`.
Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages
declare_namespace(pkg)
openadapt\.venv\lib\site-packages\pkg_resources\__init__.py:2350
openadapt\.venv\lib\site-packages\pkg_resources\__init__.py:2350
openadapt\.venv\lib\site-packages\pkg_resources\__init__.py:2350
C:\Users\jesic\PycharmProjects\PAT\openadapt\.venv\lib\site-packages\pkg_resources\__init__.py:2350: DeprecationWarning: Deprecated call to `pkg_resources.declare_namespace('google')`.
Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages
declare_namespace(parent)
openadapt\.venv\lib\site-packages\pkg_resources\__init__.py:2871
C:\Users\jesic\PycharmProjects\PAT\openadapt\.venv\lib\site-packages\pkg_resources\__init__.py:2871: DeprecationWarning: Deprecated call to `pkg_resources.declare_namespace('google.logging')`.
Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages
declare_namespace(pkg)
openadapt\.venv\lib\site-packages\pkg_resources\__init__.py:2871
C:\Users\jesic\PycharmProjects\PAT\openadapt\.venv\lib\site-packages\pkg_resources\__init__.py:2871: DeprecationWarning: Deprecated call to `pkg_resources.declare_namespace('google.iam')`.
Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages
declare_namespace(pkg)
openadapt\.venv\lib\site-packages\pkg_resources\__init__.py:2871
C:\Users\jesic\PycharmProjects\PAT\openadapt\.venv\lib\site-packages\pkg_resources\__init__.py:2871: DeprecationWarning: Deprecated call to `pkg_resources.declare_namespace('mpl_toolkits')`.
Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages
declare_namespace(pkg)
openadapt\.venv\lib\site-packages\pkg_resources\__init__.py:2871
C:\Users\jesic\PycharmProjects\PAT\openadapt\.venv\lib\site-packages\pkg_resources\__init__.py:2871: DeprecationWarning: Deprecated call to `pkg_resources.declare_namespace('ruamel')`.
Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages
declare_namespace(pkg)
openadapt\.venv\lib\site-packages\pkg_resources\__init__.py:2871
C:\Users\jesic\PycharmProjects\PAT\openadapt\.venv\lib\site-packages\pkg_resources\__init__.py:2871: DeprecationWarning: Deprecated call to `pkg_resources.declare_namespace('ruamel.yaml')`.
Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages
declare_namespace(pkg)
openadapt\.venv\lib\site-packages\pkg_resources\__init__.py:2350
C:\Users\jesic\PycharmProjects\PAT\openadapt\.venv\lib\site-packages\pkg_resources\__init__.py:2350: DeprecationWarning: Deprecated call to `pkg_resources.declare_namespace('ruamel')`.
Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages
declare_namespace(parent)
openadapt\.venv\lib\site-packages\pkg_resources\__init__.py:2871
openadapt\.venv\lib\site-packages\pkg_resources\__init__.py:2871
openadapt\.venv\lib\site-packages\pkg_resources\__init__.py:2871
openadapt\.venv\lib\site-packages\pkg_resources\__init__.py:2871
C:\Users\jesic\PycharmProjects\PAT\openadapt\.venv\lib\site-packages\pkg_resources\__init__.py:2871: DeprecationWarning: Deprecated call to `pkg_resources.declare_namespace('sphinxcontrib')`.
Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages
declare_namespace(pkg)
-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
========================================= short test summary info =========================================
FAILED tests/openadapt/test_scrub.py::test_scrub_image - pytesseract.pytesseract.TesseractNotFoundError: ...
FAILED tests/openadapt/test_summary.py::test_summary_empty - LookupError: NLTK tokenizers are missing or ...
FAILED tests/openadapt/test_summary.py::test_summary_sentence - LookupError: NLTK tokenizers are missing ...
========================== 3 failed, 22 passed, 31 warnings in 110.08s (0:01:50) ========================== `
@jesicasusanto
I think you do not have TesseractOCR installed. So that is why the first test fails in test_scrub.
Describe the bug
To Reproduce
Follow recommended installation instructions in README: