Closed zachdj closed 6 years ago
I found the culprit. The lambda in this line
data = [rdd.map(lambda x: (id, x)) for id, rdd in data.items()]
uses the id
variable from the surrounding scope. But closures in Python remember the name and scope of the variable, not the value it's pointing to. So each lambda wound up with the last id
in the iterator
Oh yeah, I remember I ran into something similar in the last project.
It's creating a bunch of lambdas, one for each data item, and the lambdas close over id
. But the loop variable id
is the same for each iteration of the loop. So all lambdas are closing over the same variable, and thus using the final value.
Good catch!
It looks like every document is getting the same id when it is loaded from the preprocessor's
load_data
function. I think the id is the max doc ID. So if there are 379 documents in the manifest file, all of them get an id of 378.To replicate/demonstrate:
data = preprocess.load_data(ctx, manifest="gs://uga-dsp/project2/files/X_small_train.txt")
data = data.map(lambda x: x[0])
data.distinct().collect() # ==> [378]