anhaidgroup / py_entitymatching

BSD 3-Clause "New" or "Revised" License
183 stars 48 forks source link

Let users define sensible feature table attrs for blackbox features. #128

Closed christiemj09 closed 4 years ago

christiemj09 commented 4 years ago

For environments that makes use of both auto-generated and blackbox features, feature generation can take up to a day on tables with millions of records and tens of features. To optimize the feature generation process, it would be useful to have non-null feature metadata for blackbox features in addition to the metadata that is already implemented for built-in features. This PR is intended as a first step towards optimizing feature generation by allowing developers to manually provide sensible attributes for blackbox features in the feature table.

christiemj09 commented 4 years ago

Failed runs in Travis are addressed in #129.

jatinarora2409 commented 4 years ago

Hi Matt, I was going through this PR, can you help me tell, where is the metaData exactly being used in py_entityMatching. This will help me check if anything breaks or not.

Or like, can you explain, why making this change helps you?

christiemj09 commented 4 years ago

@jatinarora2409 Certainly! PyMatcher currently defines feature metadata for auto-generated features; see py_entitymatching/feature/autofeaturegen.py. This metadata takes the form of attributes in the feature table returned from get_features().

As for where this metadata is used in the project, I would recommend grepping around the source code and taking a look for yourself. Example:

PROJECT_ROOT="path/to/py_entitymatching"  # Example value
cd $PROJECT_ROOT

# Taking `left_attribute` as an example of feature metadata
grep -Rin left_attribute py_entitymatching

Generally, it looks like there's some interaction between parsing features from feature strings and the feature metadata, though it seems that the metadata is primarily descriptive/helper info at this point.

As for checking whether this change breaks anything or not, isn't that what the tests are for? Feel free to do your own manual checks, but in theory this is what the test suite should be doing. If you find a useful test case when doing manual QA, please add it to the test suite so that your manual efforts aren't one-off "throw-away" work.

Finally, some context on the current change and why it's desirable. Feature metadata is set at feature creation time for auto-generated features. For blackbox features, the same metadata is set to null without a way to set it at feature creation time. This change allows users to set metadata for blackbox features at feature creation time without having to perform ad-hoc manipulations on the feature table post-creation.

Though it is primarily descriptive info now, the feature metadata could be used in alternative implementations of feature.extractfeatures.extract_feature_vecs(). The idea here is to standardize the interface for features that the feature table provides so that blackbox features aren't artificially distinct from auto-generated ones, forcing developers to "code around" blackbox features in alternative implementations of feature extraction.

jatinarora2409 commented 4 years ago

Yes, I also feel that it is currently just for info and not have any functional usage. I was skeptical about it, hence confirmed it from you.

christiemj09 commented 4 years ago

Rebased on top of changes from #138; merging now.