bluesky / tiled

API to structured data
https://blueskyproject.io/tiled
BSD 3-Clause "New" or "Revised" License
59 stars 50 forks source link

Search by set fails #376

Closed cjtitus closed 1 year ago

cjtitus commented 1 year ago

Summary

It would be nice to be able to search Tiled with either a set of values, or a single value. Both of these operations fail in different, and not very clean ways.

If searching via sets and single values is not going to be allowed, it sure would be nice to throw some nice exception, so that we don't chuck pymongo errors at users.

Details

Given a catalog, searching by a collection of values is a perfectly nice thing to do

cat = from_profile('ucal')
cat.search(In('group_md.name', ['pristine']))
<Catalog {5, 6, 7, 8, 9, 10}>

However, if this collection of values is a set you get an error

cat.search(In('group_md.name', {'pristine'}))
...
File ~/.conda/envs/ucal/lib/python3.9/json/encoder.py:179, in JSONEncoder.default(self, o)
    160 def default(self, o):
    161     """Implement this method in a subclass such that it returns
    162     a serializable object for ``o``, or calls the base implementation
    163     (to raise a ``TypeError``).
   (...)
    177 
    178     """
--> 179     raise TypeError(f'Object of type {o.__class__.__name__} '
    180                     f'is not JSON serializable')

TypeError: Object of type set is not JSON serializable

As set is a perfectly nice way to store keys that you may want to search for (in fact, it's nicer than a list! Values are guaranteed to be unique!), it would be great to be able to search by a set of values.

Searching for just a single value also fails with an extremely long error trace, that ends with an OperationFailure

cat = from_profile('ucal')
cat.search(In('group_md.name', 'pristine'))
...
File ~/.conda/envs/ucal/lib/python3.9/site-packages/pymongo/network.py:160, in command(sock_info, dbname, spec, is_mongos, read_preference, codec_options, session, client, check, allowable_errors, address, listeners, max_bson_size, read_concern, parse_write_concern_error, collation, compression_ctx, use_op_msg, unacknowledged, user_fields, exhaust_allowed)
    158             client._process_response(response_doc, session)
    159         if check:
--> 160             helpers._check_command_response(
    161                 response_doc,
    162                 sock_info.max_wire_version,
    163                 allowable_errors,
    164                 parse_write_concern_error=parse_write_concern_error,
    165             )
    166 except Exception as exc:
    167     if publish:

File ~/.conda/envs/ucal/lib/python3.9/site-packages/pymongo/helpers.py:180, in _check_command_response(response, max_wire_version, allowable_errors, parse_write_concern_error)
    177 elif code == 43:
    178     raise CursorNotFound(errmsg, code, response, max_wire_version)
--> 180 raise OperationFailure(errmsg, code, response, max_wire_version)

OperationFailure: $in needs an array, full error: {'ok': 0.0, 'errmsg': '$in needs an array', 'code': 2, 'codeName': 'BadValue', '$clusterTime': {'clusterTime': Timestamp(1671211679, 1), 'signature': {'hash': b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00', 'keyId': 0}}, 'operationTime': Timestamp(1671211679, 1)}

It looks like under the hood, mongo needs a list? It would be nice to at least check if you pass in a string, and reject such keys with a much shorter and more informative error message, such as a ValueError or something.

padraic-shafer commented 1 year ago

Would it be sufficient to change the List typing to Sequence, then cast to list when encoding the query in In and NotIn?

https://github.com/bluesky/tiled/blob/52bede5ea1768b83807556ec7cae8bc147535dc4/tiled/queries.py#L335-L339

...becomes...

 key: str 
 value: Sequence[JSONSerializable] 

 def encode(self): 
     return {"key": self.key, "value": json.dumps(list(self.value))} 

This won't work for a single value, but that case should probably get handled differently anyway.


A similar change might be needed for Spec to handle sets or tuples....

https://github.com/bluesky/tiled/blob/52bede5ea1768b83807556ec7cae8bc147535dc4/tiled/queries.py#L397-L398

...becomes...

    include: Sequence[str]
    exclude: Sequence[str]
danielballan commented 1 year ago

That solution sounds good to me. Would you be up for a PR, @cjtitus?

cjtitus commented 1 year ago

Seems like @pshafer-als has already written the code -- would they like to submit a PR?

If not, I'll get around to it sooner or later.

padraic-shafer commented 1 year ago

Sure. I'll submit PR in the next few days. All of you will be welcome to modify it as needed.

padraic-shafer commented 1 year ago

PR #379 enables generic sequences to be used for the value pass to In(key, value) and NotIn(key, value). This functionality already existed for Specs.

As a consequence In and NotIn now accept a string that is passed as value, but the behavior is not what was requested when this Issue was submitted. Instead, the string is converted to a list of characters...each of which is compared to the key. It's simple enough to change this behavior, but more discussion is probably needed.

If a user passes a string value to In or NotIn, will they generally be expecting it to be used as an array of single-character keys? Maybe, because this is a common idiom in python. But as clearly expressed in this Issue, the user might instead expect that that string is converted to a one-element list that contains the provided string. This ambiguity is probably good reason to disallow a string where a list is expected.

Options

  1. Allow string value, to be used as list of characters. This is the current behavior. Should add an explicit test, such as in branch test-string-as-list.
  2. Disallow string value; raise a ValueError if attempted. This is consistent with the previous behavior of include and exclude parameters for Specs. See branch disallow-string-for-query-list
  3. Allow string value; convert it into a 1-item list that contains the string. Branch allow-string-for-query-list does this for In, NotIn, and Specs. To keep this behavior consistent, this branch explicitly removes the ValueError checks from Specs. This might have other consequences.

What do people think about these options?

cjtitus commented 1 year ago

I think that option 1 is potentially confusing. I very rarely want to treat a string as an array of characters, and have never contemplated using a string as an "array" of length-1 keys to be passed into a search function. I would think that most people who naively pass a string in without reading the documentation carefully will expect it to be parsed as a single key.

I personally do a lot of string-to-1-item-list conversions when I write functions, because I very often want to write functions that operate on one or more strings, and never want to treat a string as an array of characters. I would find option 3 to be the most convenient.

However, I think that the safest thing is probably option 2. If someone passes a string, throw a ValueError and make them explicitly either pass it as a 1-item list, or make an actual array of characters.

tldr; I would be happy keeping the behavior consistent, and disallowing string values.

padraic-shafer commented 1 year ago

@danielballan What do you think about taking this one step further by making a judgement call on the three options above?

padraic-shafer commented 1 year ago

FWIW, I agree that Option 2 provides clear expectation and probably covers the most common use case.

If someone wanted the behavior of Option 1, they could cast the character string to iter or list. Similar logic applies for Option 3: the string can be passed as a 1-item sequence parameter. This also avoids the ambiguity of substring matching--i.e., whether In('op_key', 'candy') matches when {op_key: 'and'} is defined.