apple / turicreate

Turi Create simplifies the development of custom machine learning models.
BSD 3-Clause "New" or "Revised" License
11.19k stars 1.14k forks source link

SFrame Materialization - wierd behavior on very large datasets #3232

Open nishantnath opened 4 years ago

nishantnath commented 4 years ago

System/Environment Info: (Running on GCP) Scale/Size of dataset (all 224x224 resized jpg i.e. resnet image size): around 6-million images/3 TB overall size RAM: 78GB CPU: 16 GPU: 1 (Tesla P100 16GB) Disk: 8TB SSD

Wierd behaviors I noticed which weren't an issue on datasets under 200GB.

Append behavior with respect to materialization/disk-persistence

sf1 = tc.load_sframe('path1') # nearly 6-Millions images
sf2 = tc.load_images('path2') # nearly 1000 images
sf2.materialize()
sf1 = sf1.append(sf2)
sf1.materialize()
sim_model = tc.image_similarity.create(sf1)

saving sf2 to disk with sf2.save('path3') followed by reloading sf2 = tc.load_sframe('path3') makes the similarity model creation almost 3X faster than simply .materialize() (which I assume should have persisted the data to tc.config.TURI_CACHE_FILE_LOCATIONS so the behavior should have been same?

Image Similarity Create is slower when using a materialized SFrame

sf_standalone = tc.load_sframe('path1') # nearly 6-Millions images
sf1 = tc.load_sframe('path1') # nearly 6-Millions images
sf2 = tc.load_images('path2') # nearly 1000 images
sf2.materialize()
sf1 = sf1.append(sf2)
sf1.materialize()

sim_model = tc.image_similarity.create(sf1) is 8X slower than sim_model = tc.image_similarity.create(sf_standalone)

feels like materialization isn't working as expected (although is_materialized() returns True)

TobyRoseman commented 4 years ago

I think the difference between creating a model with sf1 vs sf_standalone probably makes sense. sf1 has about a thousand more images. The algorithm being used here isn't linear; I believe it's O(n^2). The toolkit needs to compute the distance for every pair of images.

The materialization issue does seem suspicious. How many times have you ran this experiment, i.e. how many times have you measured the speed difference between creating a model with a materialized SFrame vs one loaded from disk? Do you get consistent results each time? You say one is 3x faster than the other but what is the total amount of time to create each? Some of variability is to be expected.

nishantnath commented 4 years ago

Makes sense even with a 1000 image increase it is actually (n+m)^2 - n^2 = m^2 +2nm increase in time.

Regarding the materialization difference between .materialize() and .load_sframe(), it would be difficult for me to reproduce this effectively given the scale but i'd give this a try again with a smaller dataset after #3210 because I feel this slow-down has more to do with re-computation of deep-features than actual model creation/training time.

TobyRoseman commented 4 years ago

@nishantnath - #3210 has been merged and is included in our latest 6.4 release. Please go ahead and check the difference between .materialize() and .load_sframe() that you mentioned.