SFrame.apply() methods with side-effects not working as expected

afranklin commented 6 years ago

Observed Behavior:

Passing a method with side-effects to apply() for a SFrame only seems to execute the passed method for the first 10 rows, even though the SFrame returned by apply() has the correct length, and correctly updated values.

Perhaps this is because the apply() call is being handled by different sub-processes, so the side effects are not available to the caller? If that’s the case, maybe it’s possible to just include a warning about this behavior in the docs, since this is different from the behavior in Pandas.

Example code below.

Notes:


import turicreate as tc

data = tc.SFrame(range(20))
applied_rows = []

def process(row):
    applied_rows.append(row)
    return row['X1'] + 100

modified_data = data.apply(lambda row: process(row))

print "Data length: {}".format(len(data))
print "Modified Data length: {}".format(len(modified_data))
print "applied_rows length: {}".format(len(applied_rows))
print modified_data
print applied_rows

Data length: 20
Modified Data length: 20
applied_rows length: 10
[100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119]
[{'X1': 0}, {'X1': 1}, {'X1': 2}, {'X1': 3}, {'X1': 4}, {'X1': 5}, {'X1': 6}, {'X1': 7}, {'X1': 8}, {'X1': 9}]

afranklin commented 6 years ago

Sounds like the biggest issue here is that the side effects are inconsistent: the first n rows can have side effects (because the lambda runs in-process for type inference) but subsequent rows cannot (because the lambda runs out of process). Two possible fixes to make this a better UX:

Run the type inference pass in a separate process, just like the real lambda workers -- this way the behavior will be consistent across all rows/dataset sizes.
Add an API parameter to force in-process execution (slow, but perhaps intentional for side effects). Could be "allow_side_effects=False" by default or something.

drewfrank commented 5 years ago

FWIW, Tim just ran into this issue. My first reaction was, of course, that it must be a bug in his code 😛 . IMO this is quite surprising -- there's nothing in the documentation that says side effects result in weird behavior. Even if type inference is done in a separate process, running the lambda more than once could result in very unexpected behavior.

apple / turicreate

SFrame.apply() methods with side-effects not working as expected #692