HazyResearch / fonduer

A knowledge base construction engine for richly formatted data
https://fonduer.readthedocs.io/
MIT License
409 stars 77 forks source link

sqlalchemy.exc.InvalidRequestError during labeler.apply or featurizer.apply #482

Closed HiromuHota closed 4 years ago

HiromuHota commented 4 years ago

Description of the bug

When executing labeler.apply or featurizer.apply, sqlalchemy.exc.InvalidRequestError occurs.

To Reproduce

Steps to reproduce the behavior:

  1. Deploy the fonduer-tutorials using docker
  2. Upgrade fonduer to the latest commit on the master branch
  3. Run the hardware tutorial up to 2.2 Multimodal Featurization

Expected behavior

labeler.apply and featurizer.apply run without an error.

Error Logs/Screenshots

Exception in thread Thread-15:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/threading.py", line 926, in _bootstrap_inner
    self.run()
  File "/usr/local/lib/python3.7/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/home/user/.venv/lib/python3.7/site-packages/fonduer/utils/udf.py", line 136, in in_thread_func
    self.last_docs.add(doc.name)
  File "/home/user/.venv/lib/python3.7/site-packages/sqlalchemy/orm/attributes.py", line 286, in __get__
    return self.impl.get(instance_state(instance), dict_)
  File "/home/user/.venv/lib/python3.7/site-packages/sqlalchemy/orm/attributes.py", line 717, in get
    value = state._load_expired(state, passive)
  File "/home/user/.venv/lib/python3.7/site-packages/sqlalchemy/orm/state.py", line 652, in _load_expired
    self.manager.deferred_scalar_loader(self, toload)
  File "/home/user/.venv/lib/python3.7/site-packages/sqlalchemy/orm/loading.py", line 1012, in load_scalar_attributes
    only_load_props=attribute_names,
  File "/home/user/.venv/lib/python3.7/site-packages/sqlalchemy/orm/loading.py", line 207, in load_on_ident
    identity_token=identity_token,
  File "/home/user/.venv/lib/python3.7/site-packages/sqlalchemy/orm/loading.py", line 287, in load_on_pk_identity
    return q.one()
  File "/home/user/.venv/lib/python3.7/site-packages/sqlalchemy/orm/query.py", line 3360, in one
    ret = self.one_or_none()
  File "/home/user/.venv/lib/python3.7/site-packages/sqlalchemy/orm/query.py", line 3329, in one_or_none
    ret = list(self)
  File "/home/user/.venv/lib/python3.7/site-packages/sqlalchemy/orm/query.py", line 3405, in __iter__
    return self._execute_and_instances(context)
  File "/home/user/.venv/lib/python3.7/site-packages/sqlalchemy/orm/query.py", line 3427, in _execute_and_instances
    querycontext, self._connection_from_session, close_with_result=True
  File "/home/user/.venv/lib/python3.7/site-packages/sqlalchemy/orm/query.py", line 3442, in _get_bind_args
    mapper=self._bind_mapper(), clause=querycontext.statement, **kw
  File "/home/user/.venv/lib/python3.7/site-packages/sqlalchemy/orm/query.py", line 3420, in _connection_from_session
    conn = self.session.connection(**kw)
  File "/home/user/.venv/lib/python3.7/site-packages/sqlalchemy/orm/session.py", line 1133, in connection
    execution_options=execution_options,
  File "/home/user/.venv/lib/python3.7/site-packages/sqlalchemy/orm/session.py", line 1139, in _connection_for_bind
    engine, execution_options
  File "/home/user/.venv/lib/python3.7/site-packages/sqlalchemy/orm/session.py", line 408, in _connection_for_bind
    self._assert_active()
  File "/home/user/.venv/lib/python3.7/site-packages/sqlalchemy/orm/session.py", line 276, in _assert_active
    "This session is in 'committed' state; no further "
sqlalchemy.exc.InvalidRequestError: This session is in 'committed' state; no further SQL can be emitted within this transaction.

The last part of the error message could be

sqlalchemy.exc.InvalidRequestError: This session is in 'prepared' state; no further SQL can be emitted within this transaction.

Environment (please complete the following information)

Additional context

I think this is a regression caused by #439 as the error happens during self.last_docs.add(doc.name), which was added by #439.

HiromuHota commented 4 years ago

As can be seen from the slightly different error message (ie "This session is in 'committed' state" and "This session is in 'prepared' state"), this issue looks to me be in a race-condition.

HiromuHota commented 4 years ago

This https://github.com/HiromuHota/fonduer-tutorials/runs/879864413 successfully demonstrated this issue.

HiromuHota commented 4 years ago

I think the root cause is that a single session is accessed concurrently by multiple threads in the UDFRunner. The main thread at https://github.com/HazyResearch/fonduer/blob/a3af3877ef94fd8466b24b7cad7145a13413ac68/src/fonduer/utils/udf.py#L149

and the another at https://github.com/HazyResearch/fonduer/blob/a3af3877ef94fd8466b24b7cad7145a13413ac68/src/fonduer/utils/udf.py#L136

Actually, the main thread's access to the session happens only when Labeler and Featurizer like below: https://github.com/HazyResearch/fonduer/blob/a3af3877ef94fd8466b24b7cad7145a13413ac68/src/fonduer/supervision/labeler.py#L309-L311