TresAmigosSD / SMV

Spark Modularized View
Apache License 2.0
42 stars 22 forks source link

Don't call walk_packages so many times in DataSetRepo #1513

Closed laneb closed 5 years ago

laneb commented 5 years ago

Continuing on my profiling campaign, I checked SmvApp.get_graph_json for hotspots. It turns out that for large projects we spend over 80% of our time in pkgutil.walk_packages, which we call for each module. For a 1000 module project, get_graph_json costs about 90 seconds. Simply caching the result of walk_packages drops this to about 10 seconds. We should be able to cache this because the result shouldn't change within a transaction. It may not even be necessary to cache it though - we hit it for each module so in order to check if the file it should be found in actually exists, but it's not clear why we can't just try to import the file and assume it doesn't exist in case of import error.