dsp-uga / elizabeth

Scalable malware detection
MIT License
0 stars 0 forks source link

Preprocessor load_data function gives every document the same id #4

Closed zachdj closed 6 years ago

zachdj commented 6 years ago

It looks like every document is getting the same id when it is loaded from the preprocessor's load_data function. I think the id is the max doc ID. So if there are 379 documents in the manifest file, all of them get an id of 378.

To replicate/demonstrate: data = preprocess.load_data(ctx, manifest="gs://uga-dsp/project2/files/X_small_train.txt") data = data.map(lambda x: x[0]) data.distinct().collect() # ==> [378]

zachdj commented 6 years ago

I found the culprit. The lambda in this line data = [rdd.map(lambda x: (id, x)) for id, rdd in data.items()]

uses the id variable from the surrounding scope. But closures in Python remember the name and scope of the variable, not the value it's pointing to. So each lambda wound up with the last id in the iterator

cbarrick commented 6 years ago

Oh yeah, I remember I ran into something similar in the last project.

It's creating a bunch of lambdas, one for each data item, and the lambdas close over id. But the loop variable id is the same for each iteration of the loop. So all lambdas are closing over the same variable, and thus using the final value.

Good catch!