Fetch attribute order inconsistent with certain joins

guzman-raphael commented 3 years ago

Bug Report

Description

Unsure exactly if this is a bug or if we are not guaranteeing a key order when as_dict=True.

What was observed is that in some cases, when fetching with fetch(..., as_dict=True)[0].values() the attribute order is inconsistent compared to fetch1(...). Perhaps is it related to the join? Intent behind doing this is that sometimes it is desired to have data arranged together per row.

Reproducibility

Include:

OS: Linux
Python Version: 3.8.1
MySQL Version: 5.7
MySQL Deployment Strategy: local-docker
DataJoint Version: 0.12.7

Steps:

import datajoint as dj
import numpy as np
from faker import Faker # pip install Faker
faker = Faker()
Faker.seed(0) # Pin down randomizer between runs

schema = dj.Schema('test2')

@schema 
class Plane(dj.Lookup):
  definition = """
  # Defines manufacturable plane model types
  plane_type    : varchar(25) # Name of plane model
  ---
  plane_rows    : int         # Number of rows in plane model i.e. range(1, plane_rows + 1)
  plane_columns : int         # Number of columns in plane model; to extract letter we will need these indices
  """
  contents = [('B_Airbus', 37, 4), ('F_Airbus', 40, 5)]

@schema 
class Airport(dj.Lookup):
  definition = """
  # Defines airport locations that can serve as origin or destination
  airport_code : int         # Airport's unique identifier
  ---
  airport_city : varchar(25) # Airport's city
  """
  contents = [(i, faker.city()) for i in range(1, 8)]

@schema
class Flight(dj.Lookup):
  definition = """
  # Defines specific planes assigned to a route
  flight_id            : int                         # Flight's unique identifier
  ---
  -> Plane                                           # Flight's plane model specs; this will simply create a relation to Plane table but not have the constraint of uniqueness
  flight_economy_price : float                       # Flight's fare price
  flight_departure     : datetime                    # Flight's departure time
  flight_arrival       : datetime                    # Flight's arrival time
  -> Airport.proj(flight_origin_code='airport_code') # Flight's origin; by using proj in this way we may rename the relation in this table
  -> Airport.proj(flight_dest_code='airport_code')   # Flight's destination
  """
  contents = [dict(flight_id = i + 1,
                   plane_type = p,
                   flight_economy_price = 456.23 + i, 
                   flight_departure = faker.date_time_this_month(),
                   flight_arrival = faker.date_time_this_month(),
                   flight_origin_code = i + 3,
                   flight_dest_code = i + 4) for i, p in enumerate(['F_Airbus',
                                                                    'B_Airbus'])]

q = Flight * Plane * Airport.proj(flight_dest_city='airport_city',
                                flight_dest_code='airport_code')

# works
# flight_id, plane_rows, plane_columns, flight_economy_price, flight_dest_city = (q & 'flight_id=1').fetch1('flight_id',
#                                                                                                           'plane_rows',
#                                                                                                           'plane_columns',
#                                                                                                           'flight_economy_price',
#                                                                                                           'flight_dest_city')
# does not work
flight_id, plane_rows, plane_columns, flight_economy_price, flight_dest_city = tuple(q.fetch('flight_id',
                                                                                           'plane_rows',
                                                                                           'plane_columns',
                                                                                           'flight_economy_price',
                                                                                           'flight_dest_city',
                                                                                           as_dict=True)[0].values())
# Order is not consistent...
assert isinstance(flight_id, np.int64)
assert isinstance(plane_rows, np.int64)
assert isinstance(plane_columns, np.int64)
assert isinstance(flight_economy_price, np.float64)
assert isinstance(flight_dest_city, str)

Error Stack:

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-159-2475c9aa1099> in <module>
   68 # Order is not consistent...
   69 assert isinstance(flight_id, np.int64)
---> 70 assert isinstance(plane_rows, np.int64)
   71 assert isinstance(plane_columns, np.int64)
   72 assert isinstance(flight_economy_price, np.float64)

AssertionError:

Expected Behavior

Order to be consistent over varying output formats.

Additional Research and Context

This was noticed when preparing an answer to this StackOverflow question.

ixcat commented 3 years ago

bare .fetch on dictionaries gives list-of-dictionaries where each dictionary contains the requested attributes (vs as-dict fetch without attribute arguments, which returns list-of-dictionaries containing all fields) - this is in contrast to numpy fetching which returns multiple N-length arrays containing A1, A2, etc when given an attribute list, and a combined array with fields when fetched with no attribute list

that said, dictionary contents are in fact table-ordered (and not fetch-argument-ordered):

>>> djmon.Event().fetch('ev_datetime', 'ev_type_id', as_dict=True)
[{'ev_datetime': datetime.datetime(2020, 12, 23, 4, 39), 'ev_type_id': 1}, {'ev_datetime': datetime.datetime(2020, 12, 22, 22, 15, 17), 'ev_type_id': 2}, {'ev_datetime': datetime.datetime(2020, 12, 22, 0, 0), 'ev_type_id': 4}]
>>> djmon.Event().fetch('ev_type_id', 'ev_datetime', as_dict=True)
[{'ev_datetime': datetime.datetime(2020, 12, 23, 4, 39), 'ev_type_id': 1}, {'ev_datetime': datetime.datetime(2020, 12, 22, 22, 15, 17), 'ev_type_id': 2}, {'ev_datetime': datetime.datetime(2020, 12, 22, 0, 0), 'ev_type_id': 4}]

our fetch semantics have some 'inner logic' which I think we can explain better (good example is fetch1 which returns dictionary vs fetch which returns ndarray with fields, as fetch1 is often used when doing 'database searching' so dictionary keys are desired, whereas bare fetch is used more in computation, so numpy is desired) - not sure how/if this should/should-not factor into resolving/not resolving this issue ..

dimitri-yatsenko commented 3 years ago

Key ordering was not even a python feature until Python 3.6, so at the time this was designed and specified, this would not have come up.

dimitri-yatsenko commented 3 years ago

I would not consider this a bug. It was not our spec to preserve the order. Perhaps an enhancement.

guzman-raphael commented 3 years ago

@dimitri-yatsenko Good note on Python feature release schedule. Wasn't sure myself when I opened this if it should be a bug or not. With that in mind, I can go ahead and convert this into a new feature request.

Side Note on Key Ordering: OrderedDict has been around since Python 3.1 before it became expected of normal dict in Python 3.6. Not sure exactly when we started introducing it into our codebase and if it has been exposed to users.

dimitri-yatsenko commented 3 years ago

Yes, we used OrderedDict in many places in the code, but we took as_dict to mean to return normal python dicts which were not ordered. The nominal use of dicts in python was and mostly remains as a hash table for looking values by keys. I would mark this as an enhancement since no new functionality is added.

guzman-raphael commented 3 years ago

@ixcat Thanks for clearing this up and providing your example. It might be what this is getting at is that the fetch result ordering is not as intuitive as I initially thought.

Perhaps this is simply just a documentation point. We could indicate that fetch results respect table/query attribute order and if it is necessary to respect an attribute order, then user should take note to design the query properly. @ixcat @dimitri-yatsenko What do you think?

ixcat commented 3 years ago

agree it would be good to review the fetch/fetch1 documentation since this issue as presented is somewhat 2.5-fold:

attribute fetch as_dict vs attribute fetch returning different 'shapes' (N-valued-list for as_dict vs N-list of N-values without; as in the initial presentation)
attribute order within as_dict dictionaries not matching attribute order in fetches (OrderedDict portion of conversation) 2.5. how documentation may/may not contribute to this confusion in 1&2

guzman-raphael commented 3 years ago

BTW, here is the 'confusion' that I'm specifically referring to when it comes to order around fetch/fetch1.

Using the Plane table from my original example:

import datajoint as dj

schema = dj.Schema('test3')

@schema 
class Plane(dj.Lookup):
    definition = """
    # Defines manufacturable plane model types
    plane_type    : varchar(25) # Name of plane model
    ---
    plane_rows    : int         # Number of rows in plane model i.e. range(1, plane_rows + 1)
    plane_columns : int         # Number of columns in plane model; to extract letter we will need these indices
    """
    contents = [('B_Airbus', 37, 4), ('F_Airbus', 40, 5)]

# Fetch
print(Plane().fetch()) # returns list of tuples which represent each row; all attributes elements ordered based in order in query
print(Plane().fetch(as_dict=True)) # returns as list of dict which represent each row; all attributes key ordered based in order in query (attribute order not guaranteed before Python 3.6)
print(Plane().fetch('plane_rows', 'plane_type')) # returns list of numpy arrays for each attribute; attribute order defined by argument order in fetch
print(Plane().fetch('plane_rows', 'plane_type', as_dict=True)) # returns as list of dict which represent each row; specific attributes key ordered based in order in query (attribute order not guaranteed before Python 3.6)

# Fetch1
q = Plane() & dict(plane_type='B_Airbus')
print(q.fetch1()) # returns as dict which represent the row; all attributes key ordered based in order in query (attribute order not guaranteed before Python 3.6)
print(q.fetch1('plane_rows', 'plane_type')) # returns tuple which represent the row; specific attributes elements ordered based in argument order in fetch

Essentially, I'd say we either need better documentation on how arg/kwargs impact results or we should address this to have a bit more consistency (though this is a delicate part of DataJoint and backward-compatibility would need to be ensured since directly user facing).

ixcat commented 3 years ago

from discussion - current issue only refers to possible reworking of attribute order within retrieved dictionary elements; documentation portion of the issue covered separately (issue: https://github.com/datajoint/datajoint-docs/issues/101 should cover it)

datajoint / datajoint-python