Open nishantnath opened 4 years ago
I think the difference between creating a model with sf1
vs sf_standalone
probably makes sense. sf1
has about a thousand more images. The algorithm being used here isn't linear; I believe it's O(n^2). The toolkit needs to compute the distance for every pair of images.
The materialization issue does seem suspicious. How many times have you ran this experiment, i.e. how many times have you measured the speed difference between creating a model with a materialized SFrame vs one loaded from disk? Do you get consistent results each time? You say one is 3x faster than the other but what is the total amount of time to create each? Some of variability is to be expected.
Makes sense even with a 1000 image increase it is actually (n+m)^2 - n^2 = m^2 +2nm increase in time.
Regarding the materialization difference between .materialize()
and .load_sframe()
, it would be difficult for me to reproduce this effectively given the scale but i'd give this a try again with a smaller dataset after #3210 because I feel this slow-down has more to do with re-computation of deep-features than actual model creation/training time.
@nishantnath - #3210 has been merged and is included in our latest 6.4 release. Please go ahead and check the difference between .materialize()
and .load_sframe()
that you mentioned.
System/Environment Info: (Running on GCP) Scale/Size of dataset (all 224x224 resized jpg i.e. resnet image size): around 6-million images/3 TB overall size RAM: 78GB CPU: 16 GPU: 1 (Tesla P100 16GB) Disk: 8TB SSD
Wierd behaviors I noticed which weren't an issue on datasets under 200GB.
Append behavior with respect to materialization/disk-persistence
saving sf2 to disk with
sf2.save('path3')
followed by reloadingsf2 = tc.load_sframe('path3')
makes the similarity model creation almost 3X faster than simply.materialize()
(which I assume should have persisted the data totc.config.TURI_CACHE_FILE_LOCATIONS
so the behavior should have been same?Image Similarity Create is slower when using a materialized SFrame
sim_model = tc.image_similarity.create(sf1)
is 8X slower thansim_model = tc.image_similarity.create(sf_standalone)
feels like materialization isn't working as expected (although is_materialized() returns True)