apple / turicreate

Turi Create simplifies the development of custom machine learning models.
BSD 3-Clause "New" or "Revised" License
11.2k stars 1.14k forks source link

SFrame.apply() methods with side-effects not working as expected #692

Open afranklin opened 6 years ago

afranklin commented 6 years ago

Observed Behavior:

Passing a method with side-effects to apply() for a SFrame only seems to execute the passed method for the first 10 rows, even though the SFrame returned by apply() has the correct length, and correctly updated values.

Perhaps this is because the apply() call is being handled by different sub-processes, so the side effects are not available to the caller? If that’s the case, maybe it’s possible to just include a warning about this behavior in the docs, since this is different from the behavior in Pandas.

Example code below.

Notes:


import turicreate as tc

data = tc.SFrame(range(20))
applied_rows = []

def process(row):
    applied_rows.append(row)
    return row['X1'] + 100

modified_data = data.apply(lambda row: process(row))

print "Data length: {}".format(len(data))
print "Modified Data length: {}".format(len(modified_data))
print "applied_rows length: {}".format(len(applied_rows))
print modified_data
print applied_rows

Data length: 20
Modified Data length: 20
applied_rows length: 10
[100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119]
[{'X1': 0}, {'X1': 1}, {'X1': 2}, {'X1': 3}, {'X1': 4}, {'X1': 5}, {'X1': 6}, {'X1': 7}, {'X1': 8}, {'X1': 9}]
afranklin commented 6 years ago

Sounds like the biggest issue here is that the side effects are inconsistent: the first n rows can have side effects (because the lambda runs in-process for type inference) but subsequent rows cannot (because the lambda runs out of process). Two possible fixes to make this a better UX:

drewfrank commented 5 years ago

FWIW, Tim just ran into this issue. My first reaction was, of course, that it must be a bug in his code 😛 . IMO this is quite surprising -- there's nothing in the documentation that says side effects result in weird behavior. Even if type inference is done in a separate process, running the lambda more than once could result in very unexpected behavior.