dedupeio / dedupe

:id: A python library for accurate and scalable fuzzy matching, record deduplication and entity-resolution.
https://docs.dedupe.io
MIT License
4.08k stars 549 forks source link

Accept file paths in addition to file-like objects #935

Closed NickCrews closed 2 years ago

NickCrews commented 2 years ago

The 1.settings_file argument to the Static{Dedupe,RecordLink,Gazetteer} constructors,

  1. training_file argument to the prepare_training() methods
  2. file_obj argument to the write_training() and write_settings() methods,

are all expected to be TextIO or BinaryIO objects. Could we expand the API to accept file paths? It would make calling code simpler by avoiding the with open(...) as f: ... boiler plate. I don't see where it would lead to any ambiguity. And it looks like due to the nice architecture using abstract base classes, it wouldn't be a large maintenance burden either (from a quick browse it looks like there would only need to be a change in a single place for each of these methods, and even then each of them could use a single _open_path_like() util method)

The one thing I'm not sure about is the logic for selecting what a "path-like" is. Maybe isinstance(str) or isinstance(path.Path)? Or just try to open() them and see if that succeeds?

I'd be happy to write a PR if this looks like an idea worth pursuing. Thanks for the library!

fgregg commented 2 years ago

nick, i don't think this needs to be handled by the this library. thanks for the idea!