Open Elijahwalkerwest opened 6 years ago
@emiliom and @lsetiawan - this is the issue I was referring to on the phone today. I'm not sure it should be titled "improve performance", but rather may require a rethinking of the current approach.
Thanks @Elijahwalkerwest and @horsburgh
@lsetiawan and I may or may not have time this week (after today) to give input. But if you can point us to the branch and code where @Elijahwalkerwest's current / latest implementation can be seen, that'll be helpful. Again, no promises!
The most recent work should be in the development branch. I've been working on a few iterations to try and resolve this but nothing that has worked thus far, and so isn't up on github yet.
Added by Emilio, for convenience: https://github.com/ODM2/ODM2PythonAPI/blob/development/odm2api/ODM2/services/readService.py#L969
Don and I have taken a quick look. Unfortunately we won't have time to help out on this possibly through next week (we have a hands-on workshop late next week that I'm co-organizing).
In the meantime, two things come to mind:
getSamplingFeatureDatasets
tries to get a ton of information all at once, and the outcome is inevitably large. We wonder if some sort of break up into smaller, complementary functions may be necessary.result = self._session_factory.engine.execute("SELECT * FROM odm2.samplingfeatures as SF\ LEFT JOIN odm2.results as R on R.FeatureActionID in (\ SELECT FeatureActionID\ FROM odm2.featureactions as FA\ WHERE FA.SamplingFeatureID = SF.SamplingFeatureID\ )\ LEFT JOIN odm2.datasets as DS on DS.DatasetID in (\ SELECT DatasetID\ FROM odm2.datasetsresults as DR\ WHERE R.ResultID = DR.ResultID\ )\ WHERE SF.SamplingFeatureID=sfid")
Got SQLAlchemy to run raw query, this is the query I'm using. This look right to you guys?
Updated version of this SQL query that is currently working, but is VERY SLOW.
Also I'm not sure if it's getting all the data needed. Here is the data that is needed for that endpoint.
DataSetID, DataSetTitle, DataSetAbstract, ResultTypeCV, SampledMediumCV, VariableCode, VariableNameCV, startDate: minDate, endDate: maxDate, siteType: SiteTypeCV, latitude:samplingFeature.related_features.Latitude, longitude:samplingFeature.related_features.Longitude SamplingFeatureCode, SamplingFeatureName,
result = self._session_factory.engine.execute("SELECT * FROM odm2.samplingfeatures as SF\ LEFT JOIN odm2.results as R on R.FeatureActionID in (\ SELECT FeatureActionID\ FROM odm2.featureactions as FA\ WHERE FA.SamplingFeatureID = SF.SamplingFeatureID\ )\ LEFT JOIN odm2.datasets as DS on DS.DatasetID in (\ SELECT DatasetID\ FROM odm2.datasetsresults as DR\ WHERE R.ResultID = DR.ResultID\ )\ WHERE SF.SamplingFeatureID in %s", ( ((sf_list),), ) )
Currently this query is taking ~ 50 seconds PER Sampling feature.
I have already utilized eager loading for the SQLALchemy query in the endpoint, but due to the nature of the data needed to be gathered, the current structure is incredibly slow as more Samplingfeatures are queried for. Need to make it usable for a query of all SamplingFeatures of a given Type.