Unstructured-IO / unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
https://www.unstructured.io/
Apache License 2.0
7.37k stars 572 forks source link

feat/migrate onedrive src #3295

Closed rbiseck3 closed 2 days ago

rbiseck3 commented 3 days ago

Description

Migrate the onedrive source connector to v2, adding in more rich content pulled from the response of the SDK to add further metadata to the FIleData produced by the indexer.

potter-potter commented 2 days ago

For some reason this isn't passing the ingest test locally for me. test_unstructured_ingest/src/onedrive.sh Some json error with the logger...

Also there is a strange error in the CI even though it seems to pass... https://github.com/Unstructured-IO/unstructured/actions/runs/9679118293/job/26706070032#step:7:3333

potter-potter commented 2 days ago
nload_only": false, "max_docs": null, "re_download": false, "uncompress": false, "status": {}, "semaphore": null}
--- Logging error ---
Traceback (most recent call last):
  File "/Users/potter/Documents/unstructured/unstructured/ingest/v2/logger.py", line 85, in redact_jsons
    formatted_j = json.dumps(json.loads(j))
  File "/Users/potter/.pyenv/versions/3.10.13/lib/python3.10/json/__init__.py", line 346, in loads
    return _default_decoder.decode(s)
  File "/Users/potter/.pyenv/versions/3.10.13/lib/python3.10/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/Users/potter/.pyenv/versions/3.10.13/lib/python3.10/json/decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 20 (char 19)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/potter/.pyenv/versions/3.10.13/lib/python3.10/logging/__init__.py", line 1100, in emit
    msg = self.format(record)
  File "/Users/potter/.pyenv/versions/3.10.13/lib/python3.10/logging/__init__.py", line 943, in format
    return fmt.format(record)
  File "/Users/potter/Documents/unstructured/unstructured/ingest/v2/logger.py", line 97, in format
    return redact_jsons(s)
  File "/Users/potter/Documents/unstructured/unstructured/ingest/v2/logger.py", line 87, in redact_jsons
    lit = ast.literal_eval(j)
  File "/Users/potter/.pyenv/versions/3.10.13/lib/python3.10/ast.py", line 110, in literal_eval
    return _convert(node_or_string)
  File "/Users/potter/.pyenv/versions/3.10.13/lib/python3.10/ast.py", line 99, in _convert
    return dict(zip(map(_convert, node.keys),
  File "/Users/potter/.pyenv/versions/3.10.13/lib/python3.10/ast.py", line 109, in _convert
    return _convert_signed_num(node)
  File "/Users/potter/.pyenv/versions/3.10.13/lib/python3.10/ast.py", line 83, in _convert_signed_num
    return _convert_num(node)
  File "/Users/potter/.pyenv/versions/3.10.13/lib/python3.10/ast.py", line 74, in _convert_num
    _raise_malformed_node(node)
  File "/Users/potter/.pyenv/versions/3.10.13/lib/python3.10/ast.py", line 71, in _raise_malformed_node
    raise ValueError(msg + f': {node!r}')
ValueError: malformed node or string on line 1: <ast.Name object at 0x14ef70850>
Call stack:
  File "/Users/potter/Documents/unstructured/./unstructured/ingest/main.py", line 11, in <module>
    main()
  File "/Users/potter/Documents/unstructured/./unstructured/ingest/main.py", line 7, in main
    ingest_cmd()
  File "/Users/potter/Documents/v/unstructured_venv/venv/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/Users/potter/Documents/v/unstructured_venv/venv/lib/python3.10/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/Users/potter/Documents/v/unstructured_venv/venv/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Users/potter/Documents/v/unstructured_venv/venv/lib/python3.10/site-packages/click/core.py", line 1666, in invoke
    rv = super().invoke(ctx)
  File "/Users/potter/Documents/v/unstructured_venv/venv/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/potter/Documents/v/unstructured_venv/venv/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/Users/potter/Documents/v/unstructured_venv/venv/lib/python3.10/site-packages/click/decorators.py", line 33, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/Users/potter/Documents/unstructured/unstructured/ingest/v2/cli/base/src.py", line 43, in cmd
    logger.error(f"failed to run source command {self.cmd_name}: {e}", exc_info=True)
Message: 'failed to run source command onedrive: Parser must be a string or character stream, not datetime'
Arguments: ()