Datatamer / tamr-client

Programmatically interact with Tamr
https://tamr-client.readthedocs.io
Apache License 2.0
11 stars 25 forks source link

Remove `simplejson` as a dependency #455

Closed skalish closed 4 years ago

skalish commented 4 years ago

💬 RFC

simplejson is now the only dependency in this project whose build is machine-dependent (i.e. if you pip download on MacOS, the result cannot be used on Ubuntu). Additionally, to my knowledge it is only used to a very limited extent: to allow a toggle to ignore_nan values in the record update operation of tamr_unify_client. This means a record (as dictionary) containing at least one NaN that is upserted to Tamr with the client will cause an HTTPError if ignore_nan == True or will have this value coerced to null if ignore_nan == False.

  1. This dependency is not present in tamr_client, however, there are no plans to remove tamr_unify_client even after the official promotion of the BETA to the primary implementation. For this reason, I think it is important that simplejson be removed to ease the installation process.

  2. If it is decided that we remove this, the ignore_nan functionality needs to either be added in its place or be dropped as an unneeded feature. In the former case, it might also be prudent to add similar functionality to tamr_client.

Related to discussion and decisions in #194 and #209.

skalish commented 4 years ago

Regarding the suggestion to allow flexibility in passing other json_args https://github.com/Datatamer/tamr-client/pull/209#discussion_r307306802: while this is a good point, I am unaware of any cases where this functionality is being taken advantage of, but I have seen at least two cases where getting tamr-unify-client installed on a customer machine without access to the internet was quite painful.

pcattori commented 4 years ago

If the main issue is that including NaN will silently fail or fail downstream, we could use the allow_nan parameter of the built-in json.dumps function (setting allow_nan=False) to fail fast.

Then users can fix their records before passing them to TC functions with allow_nan=False.

skalish commented 4 years ago

The original issue #194 was about including the option to ignore NaNs, that is, having tamr_client take care of the conversion from NaN to null if the user chooses. The allow_nan=False solution does avoid the original problematic behavior of downstream 400 errors, but this should be recognized as an intentional decision to no longer provide NaN->null functionality (which is relied upon by some field code, though I am unsure if any of it is still in use).