HazyResearch / fonduer

A knowledge base construction engine for richly formatted data
https://fonduer.readthedocs.io/
MIT License
409 stars 77 forks source link

Unbearable slowness in `Featurizer.get_feature_matrices` #483

Closed HiromuHota closed 4 years ago

HiromuHota commented 4 years ago

Description of the bug

featurizer.get_feature_matrices(train_cands) does not return within a bearable amount of time.

To Reproduce

Steps to reproduce the behavior:

  1. Deploy the fonduer-tutorials using docker
  2. Upgrade fonduer to 2a49291003c2826835b0b900425d7663f0ecc202 (introduced by #407)
  3. Run the hardware tutorial up to 2.2 Multimodal Featurization

Expected behavior

featurizer.get_feature_matrices(train_cands) returns about 1 min or so for hardware tutorial.

Error Logs/Screenshots

When I forcefully cancel the operation, the following stack trace is shown.

KeyboardInterrupt                         Traceback (most recent call last)
<timed exec> in <module>

~/.venv/lib/python3.7/site-packages/fonduer/features/featurizer.py in get_feature_matrices(self, cand_lists)
    303             features.
    304         """
--> 305         return get_sparse_matrix(self.session, FeatureKey, cand_lists)
    306 
    307 

~/.venv/lib/python3.7/site-packages/fonduer/utils/utils_udf.py in get_sparse_matrix(session, key_table, cand_lists, key)
    157             else:
    158                 annotations.append({"keys": [], "values": []})
--> 159         result.append(_convert_mappings_to_matrix(annotations, key_names))
    160     return result
    161 

~/.venv/lib/python3.7/site-packages/fonduer/utils/utils_udf.py in _convert_mappings_to_matrix(mappings, keys)
    186         if mapping:
    187             for key, value in zip(mapping["keys"], mapping["values"]):
--> 188                 if key in keys:
    189                     indices.append(keys_map[key])
    190                     data.append(value)

Environment (please complete the following information)

Additional context

I think this is a regression caused by #407. Also this issue is not noticeable unless the feature vector is big (like 30000 dimensions).

HiromuHota commented 4 years ago

https://github.com/HiromuHota/fonduer-tutorials/runs/883626478 demonstrates the hardware tutorial stalls at featurizer.get_feature_matrices(train_cands) (right after featurizer.apply(split=0, train=True, parallelism=PARALLEL)) for about 20 minutes.

2020-07-17T22:46:57.5633398Z CPU times: user 21min 12s, sys: 3.23 s, total: 21min 16s
2020-07-17T22:46:57.5633521Z Wall time: 21min 24s
2020-07-17T22:46:57.5633638Z (28935, 27732)
HiromuHota commented 4 years ago

Demonstrated by https://github.com/HiromuHota/fonduer-tutorials/runs/883723629, the proposed fix #484 can reduce the elapsed time to about 1 min.

Fri, 17 Jul 2020 23:21:41 GMT CPU times: user 49.5 s, sys: 3.21 s, total: 52.8 s
Fri, 17 Jul 2020 23:21:41 GMT Wall time: 1min 2s
Fri, 17 Jul 2020 23:21:46 GMT (28935, 27735)