ODM2 / ODM2PythonAPI

A set of Python functions that provides data read/write access to an ODM2 database by leveraging SQLAlchemy.
http://odm2.github.io/ODM2PythonAPI/
BSD 3-Clause "New" or "Revised" License
4 stars 13 forks source link

Improve performance for getSamplingFeatureDatasets #144

Open Elijahwalkerwest opened 6 years ago

Elijahwalkerwest commented 6 years ago

I have already utilized eager loading for the SQLALchemy query in the endpoint, but due to the nature of the data needed to be gathered, the current structure is incredibly slow as more Samplingfeatures are queried for. Need to make it usable for a query of all SamplingFeatures of a given Type.

horsburgh commented 6 years ago

@emiliom and @lsetiawan - this is the issue I was referring to on the phone today. I'm not sure it should be titled "improve performance", but rather may require a rethinking of the current approach.

emiliom commented 6 years ago

Thanks @Elijahwalkerwest and @horsburgh

@lsetiawan and I may or may not have time this week (after today) to give input. But if you can point us to the branch and code where @Elijahwalkerwest's current / latest implementation can be seen, that'll be helpful. Again, no promises!

Elijahwalkerwest commented 6 years ago

The most recent work should be in the development branch. I've been working on a few iterations to try and resolve this but nothing that has worked thus far, and so isn't up on github yet.

Added by Emilio, for convenience: https://github.com/ODM2/ODM2PythonAPI/blob/development/odm2api/ODM2/services/readService.py#L969

emiliom commented 6 years ago

Don and I have taken a quick look. Unfortunately we won't have time to help out on this possibly through next week (we have a hands-on workshop late next week that I'm co-organizing).

In the meantime, two things come to mind:

Elijahwalkerwest commented 6 years ago

result = self._session_factory.engine.execute("SELECT * FROM odm2.samplingfeatures as SF\ LEFT JOIN odm2.results as R on R.FeatureActionID in (\ SELECT FeatureActionID\ FROM odm2.featureactions as FA\ WHERE FA.SamplingFeatureID = SF.SamplingFeatureID\ )\ LEFT JOIN odm2.datasets as DS on DS.DatasetID in (\ SELECT DatasetID\ FROM odm2.datasetsresults as DR\ WHERE R.ResultID = DR.ResultID\ )\ WHERE SF.SamplingFeatureID=sfid")

Got SQLAlchemy to run raw query, this is the query I'm using. This look right to you guys?

Elijahwalkerwest commented 6 years ago

Updated version of this SQL query that is currently working, but is VERY SLOW.

Also I'm not sure if it's getting all the data needed. Here is the data that is needed for that endpoint.

DataSetID, DataSetTitle, DataSetAbstract, ResultTypeCV, SampledMediumCV, VariableCode, VariableNameCV, startDate: minDate, endDate: maxDate, siteType: SiteTypeCV, latitude:samplingFeature.related_features.Latitude, longitude:samplingFeature.related_features.Longitude SamplingFeatureCode, SamplingFeatureName,

result = self._session_factory.engine.execute("SELECT * FROM odm2.samplingfeatures as SF\ LEFT JOIN odm2.results as R on R.FeatureActionID in (\ SELECT FeatureActionID\ FROM odm2.featureactions as FA\ WHERE FA.SamplingFeatureID = SF.SamplingFeatureID\ )\ LEFT JOIN odm2.datasets as DS on DS.DatasetID in (\ SELECT DatasetID\ FROM odm2.datasetsresults as DR\ WHERE R.ResultID = DR.ResultID\ )\ WHERE SF.SamplingFeatureID in %s", ( ((sf_list),), ) )

Currently this query is taking ~ 50 seconds PER Sampling feature.