Improve performance for getSamplingFeatureDatasets

Elijahwalkerwest commented 6 years ago

I have already utilized eager loading for the SQLALchemy query in the endpoint, but due to the nature of the data needed to be gathered, the current structure is incredibly slow as more Samplingfeatures are queried for. Need to make it usable for a query of all SamplingFeatures of a given Type.

horsburgh commented 6 years ago

@emiliom and @lsetiawan - this is the issue I was referring to on the phone today. I'm not sure it should be titled "improve performance", but rather may require a rethinking of the current approach.

emiliom commented 6 years ago

Thanks @Elijahwalkerwest and @horsburgh

@lsetiawan and I may or may not have time this week (after today) to give input. But if you can point us to the branch and code where @Elijahwalkerwest's current / latest implementation can be seen, that'll be helpful. Again, no promises!

Elijahwalkerwest commented 6 years ago

The most recent work should be in the development branch. I've been working on a few iterations to try and resolve this but nothing that has worked thus far, and so isn't up on github yet.

Added by Emilio, for convenience: https://github.com/ODM2/ODM2PythonAPI/blob/development/odm2api/ODM2/services/readService.py#L969

emiliom commented 6 years ago

Don and I have taken a quick look. Unfortunately we won't have time to help out on this possibly through next week (we have a hands-on workshop late next week that I'm co-organizing).

In the meantime, two things come to mind:

If you have an equivalent SQL SELECT statement that you've developed and run directly on the ODM2 database, that is reasonably fast (and I assume not directly translatable to SQLAlchemy queries), please share it here.
In general (and based in part on some quick looks from Don), it seems like getSamplingFeatureDatasets tries to get a ton of information all at once, and the outcome is inevitably large. We wonder if some sort of break up into smaller, complementary functions may be necessary.

Elijahwalkerwest commented 6 years ago

result = self._session_factory.engine.execute("SELECT * FROM odm2.samplingfeatures as SF\ LEFT JOIN odm2.results as R on R.FeatureActionID in (\ SELECT FeatureActionID\ FROM odm2.featureactions as FA\ WHERE FA.SamplingFeatureID = SF.SamplingFeatureID\ )\ LEFT JOIN odm2.datasets as DS on DS.DatasetID in (\ SELECT DatasetID\ FROM odm2.datasetsresults as DR\ WHERE R.ResultID = DR.ResultID\ )\ WHERE SF.SamplingFeatureID=sfid")

Got SQLAlchemy to run raw query, this is the query I'm using. This look right to you guys?

Elijahwalkerwest commented 6 years ago

Updated version of this SQL query that is currently working, but is VERY SLOW.

Also I'm not sure if it's getting all the data needed. Here is the data that is needed for that endpoint.

DataSetID, DataSetTitle, DataSetAbstract, ResultTypeCV, SampledMediumCV, VariableCode, VariableNameCV, startDate: minDate, endDate: maxDate, siteType: SiteTypeCV, latitude:samplingFeature.related_features.Latitude, longitude:samplingFeature.related_features.Longitude SamplingFeatureCode, SamplingFeatureName,

result = self._session_factory.engine.execute("SELECT * FROM odm2.samplingfeatures as SF\ LEFT JOIN odm2.results as R on R.FeatureActionID in (\ SELECT FeatureActionID\ FROM odm2.featureactions as FA\ WHERE FA.SamplingFeatureID = SF.SamplingFeatureID\ )\ LEFT JOIN odm2.datasets as DS on DS.DatasetID in (\ SELECT DatasetID\ FROM odm2.datasetsresults as DR\ WHERE R.ResultID = DR.ResultID\ )\ WHERE SF.SamplingFeatureID in %s", ( ((sf_list),), ) )

Currently this query is taking ~ 50 seconds PER Sampling feature.

ODM2 / ODM2PythonAPI

Improve performance for getSamplingFeatureDatasets #144