georgian-io-archive / foreshadow

An automatic machine learning system
https://foreshadow.readthedocs.io
Apache License 2.0
29 stars 2 forks source link

The calculation in metrics.regex_rows() is not consistent with the documentation #161

Open jichaoz opened 4 years ago

jichaoz commented 4 years ago

https://github.com/georgianpartners/foreshadow/blob/c2c213e0009cfdcf0aa9df75f0a6cf4c983d7090/foreshadow/metrics.py#L184

Here, before the sum, we should get a 0 or 1 value for each row. But instead, we are getting the matched length for each row, which leads to a final score larger than 1. Here are the code the reproduce the issue:

import pandas as pd
from foreshadow.concrete import DollarFinancialCleaner

x = pd.DataFrame({'price': ['$3', '$5.0', '$5,000.00']})
financial_cleaner = DollarFinancialCleaner()
metric = financial_cleaner.metric_score(x)
print(metric)

The expected value is 1 but get 4.2 instead.

jichaoz commented 4 years ago

@cchoquette , can you take a look?