Living-with-machines / DeezyMatch

A Flexible Deep Learning Approach to Fuzzy String Matching
https://living-with-machines.github.io/DeezyMatch/
Other
139 stars 34 forks source link

csv_sep is currently fixed to \t. Make this more general. #38

Open kasra-hosseini opened 4 years ago

kasra-hosseini commented 4 years ago

@mcollardanuy wrote:

Hi, I'm afraid in some scenarios this may potentially discard many rows if csv_sep is, for example, a comma, as it is not uncommon that the comma is part of an entity name (e.g. "Smith, John" if we try to link person names). Our solution at the moment is not sensitive to quoted text (the original code was sensitive to quoted text, but we had that strange parsing bug). That's why I was suggesting tab as the only accepted delimiter for now, because we'd rarely expect a tab to be part of a query or candidate. What do you think?