Provide PostGRES support for storing large and geospatial data

aashish24 commented 9 years ago

We should investigate other databases as well.
Need to figure our relationship to Girder

cpatrick commented 9 years ago

+1 to adding pgsql support to Girder as an alternate model layer. This would be non-trivial to implement and maintain, but would lower the overall burden on projects that use Girder and would benefit from pgsql support.

-Patrick

On Fri, Oct 2, 2015 at 7:53 PM, Aashish Chaudhary notifications@github.com wrote:

We should investigate other databases as well.

Need to figure our relationship to Girder

— Reply to this email directly or view it on GitHub https://github.com/Kitware/minerva/issues/135.

mgrauer commented 9 years ago

The use case need definition.

We already have geospatial search in Girder items in Mongo through the geospatial plugin. We also have the dataservices work in minerva which include a postgis DB that we can interact with through a geoserver search api.

What specific use case do we have that can not be supported by one of these?

Is the large geospatial data expected to be a source that is external to a Minerva source, similar to our current Mongo collection source? I.e., we have a Minerva source which is a Girder item, and that item has metadata about how to connect to the datastore, but we do not track all of the data in the source within Girder. If this is the case, what operations do we want to support, and why can't this be done via geoserver?

manthey commented 8 years ago

I've been tasked with adding this feature, and I want to make sure that whatever I pick will satisfy various people's requirements. There have been discussions via email and meatspace, but I'd like to move that here so we have a good record of what gets decided. People involved include @aashish24, @zachmullen, @danlamanna, @mgrauer, @mbertrand, @kotfic.

I don't think this is a Minerva-specific task, so it would be good if we can satisfy wider needs.

I think this should be a girder plugin. I'm open to suggestions about which repo it should reside in (probably its own under an existing github organization).

@aashish24 said he would like to see a postgres database as a collection with each table as an item (presumably within a folder within that collection). Since a folder is necessary to contain items in a collection, perhaps a better model would be to configure a postgres database into a folder (which could be in a collection or not). When would we check for new tables or dropped tables? Or is this only done at configuration time?

In the geoapp, I create an end point per table with standard database options: limit, offset, sort direction, sort direction, returned fields. A query can produce a function where clause matching any number of restrictions (anded together). Each column in the table can be matched to an exact value. Numeric and dates can be restricted to one or both ends of a range. Strings can be matched using stemming and lexical grouping.

This would be easy to extend to read the table's format and autogenerate rest endpoint based on that. If we want to include special queries for PostGIS datatypes we can do that.

The virtue of this technique is that it is easy to extend to access other databases (Mongo and elasticsearch would need more configuration, as there is no distinct concept of columns).

Alternately, if the goal is have full ability to do anything that Postgresql supports, we should either just use PostREST or something similar.

There could also be a new model (a superset of the item model) that would expose python functions for database access.

I could see a girder plugin with configuration options allowing a list, each with the following:

Database Name: (user defined name to reference this database)
Database Type: (only Postgres for now)
Database Location: (127.0.0.1:5432 by default)
Show database in folder: [select or create a folder here]
Show tables: [blank for all tables in the default schema, or a json list]

There would be new endpoints: GET database/(item id)/find -- get data from a table GET database/(database name) -- get a list of items/tables associated with the db

We might also want PUT database/(database name)/refresh -- refresh the tables associated with the db

I could see it being useful to have the end points under the item (for instance at item/(item id)/database/find) instead.

Feedback is welcome.

zachmullen commented 8 years ago

In the geoapp, I create an end point per table with standard database options: limit, offset, sort direction, sort direction, returned fields. A query can produce a function where clause matching any number of restrictions (anded together). Each column in the table can be matched to an exact value. Numeric and dates can be restricted to one or both ends of a range. Strings can be matched using stemming and lexical grouping.

I certainly prefer this approach as a general solution. It might not meet @aashish24 's needs for GUMBO in particular, so we should consider that, but in general we don't want to expose arbitrary database connections to end users as girder would become a botnet very quickly. With your approach, we at least control the database connections, though malicious users could still construct queries designed to cripple the availability of the backend.

To me, the truly generally useful (and secure!) part of this is simply in exposing utilities for girder+pgsql integration. That ends up being a pretty minimal layer, but I could see us reusing such utilities across many projects.

danlamanna commented 8 years ago

+1 to adding pgsql support to Girder as an alternate model layer. This would be non-trivial to implement and maintain, but would lower the overall burden on projects that use Girder and would benefit from pgsql support.

@cpatrick I've actually done a little bit of hacking with SQLAlchemy on the side to support this. I didn't go any farther than creating an alternate model layer and changing the AssetStore model to use Postgres instead of Mongo - but I agree it would be non-trivial to do it on the entire system.

I think the larger question though, is what is Mongo bringing Girder that Postgres couldn't? It seems like people want to use features like PostGIS and FTS to which Mongo's implementations just aren't as good.

On task, it seems like @mgrauer is right that this needs quite a bit of fleshing out in order to provide legitimate feedback. Some research time might be good to see if we can use logic and/or ideas from PostgREST and others.

zachmullen commented 8 years ago

I think the larger question though, is what is Mongo bringing Girder that Postgres couldn't?

Complete schemalessness. We don't even have to declare what tables (collections) we are using until runtime. It just alleviates the need for database scripts and schema upgrade/downgrade infrastructure. It also makes it extremely easy for plugins to modify core collections since indexes can be declared at runtime.

kotfic commented 8 years ago

I included these in an earlier thread, just for completeness I am going to put them back here. Please see:

https://wiki.postgresql.org/wiki/HTTP_API for a proposal that illustrates some of the complexity of this problem. Also, while we probably can't use PostgREST (https://github.com/begriffs/postgrest) because it is written in Haskell - It is probably worth taking a look before we dive straight into designing another tool from scratch.

/Chris

On Tue, Feb 16, 2016 at 11:53 AM, Zach Mullen notifications@github.com wrote:

I think the larger question though, is what is Mongo bringing Girder that Postgres couldn't?

Complete schemalessness. We don't even have to declare what tables (collections) we are using until runtime. It just alleviates the need for database scripts and schema upgrade/downgrade infrastructure. It also makes it extremely easy for plugins to modify core collections since indexes can be declared at runtime.

— Reply to this email directly or view it on GitHub https://github.com/Kitware/minerva/issues/135#issuecomment-184766933.

Christopher Kotfila R&D Engineer Kitware Inc. http://www.kitware.com/company/team/kotfila.html

aashish24 commented 8 years ago

I certainly prefer this approach as a general solution. It might not meet @aashish24 's needs for GUMBO in particular, so we should consider that, but in general we don't want to expose arbitrary database connections to end users as girder would become a botnet very quickly. With your approach, we at least control the database connections, though malicious users could still construct queries designed to cripple the availability of the backend.

Does it sounds plausible to expose information on what backend storage (mongo, postgres) is used to store the data? If we know that the server side implementation can provide specific queries in native backend sql language for some complex geospatial queries on a folder or on an item. I understand that it will security concerns but I think unless you have some sort of layer like oracle has (views on db) I don't know if we can prevent bad code that is deployed on server to prevent running.

manthey commented 8 years ago

For a Postgres database using psycopg2, we know the data type of each column. Simple data types lend themselves to generalized queries very easily. For instance, GET /item/{id}/database/select?distance_min=4&distance_max=8&caption_regex=(fast|slow).

If is less obvious to me how we would want to expose PostGIS data types (and even the built in geometric datatypes). For instance, if we call GET /item/{id}/database/select and specify a column that is of the geometry datatype, it is probably more useful to apply some ST_* function to that column then to perform a direct comparison.

For someone who is actively using the PostGIS functions, can you think of how you would like this exposed via a REST call? For instance, I could easily see wanting to make a query like SELECT town FROM towns WHERE ST_Intersects(ST_SetSRID(ST_MakePoint(-72, 42.3601), 4326), ST_Transform(geom, 4326));, where town and geom are columns in the table. It seems like this isn't particularly generalizable, as you need knowledge of PostGIS to construct such a query, and allowing general SQL makes it very hard to prevent injection attacks.

Thoughts, anyone?

mbertrand commented 8 years ago

For a query like:

SELECT town FROM towns WHERE ST_Intersects(ST_SetSRID(ST_MakePoint(-72, 42.3601), 4326), ST_Transform(geom, 4326));

How about something like GET /item{id}/database/select?intersects=Point(-72 42.3601)&columns=town

And the code would be responsible for forming the correct query, something like this:

def create_intersect_query(table, column, intersects_feature, epsg=4326):
    intersect_query = "SELECT %s from %s where ST_Intersects(ST_SetSRID(ST_GeomFromText(%s), {epsg}), ST_Transform(the_geom, {epsg}));".format(epsg=epsg)
    params = (table, columns, intersects_feature)
        conn = psycopg2.connection(....)
    try:
    c = conn.cursor()
        c.execute(intersect_query, params)
        c.fetchall()
    finally:
        conn.close()

This approach won't be able to handle every possible query that a user might want to make, but it helps avoid injection attacks and doesn't require the user to have deep knowledge of SQL.

I'm thinking of using a similar approach for Gaia.

-Matt

mbertrand commented 8 years ago

Here's a sample of the approach I've been taking with gaia, interested to hear what you think of it:

class PostgisIO(GaiaIO):

    def get_table_info(self):
        query = 'SELECT f_geometry_column, srid from geometry_columns ' \
                'WHERE f_table_name = %s'
        return self.pg_query(query, [self.table])[0]

    def get_query(self):
        self.geom_column, self.epsg = self.get_table_info()
        if self.columns[0] == '*':
            columns = self.columns[0]
        else:
            if self.geom_column not in self.columns:
                self.columns.append(self.geom_column)
            columns =  ','.join(
                ['"{}"'.format(x) for x in self.columns])
        query = 'SELECT {} from "{}"'.format(columns, self.table)
        if self.filter:
            filter_sql, filter_params = filter_postgis(self.filter)
            query += ' WHERE {}'.format(filter_sql)
        query += ';'
        return query, filter_params

def filter_postgis(filters):
    """
    Generate a SQL statement to be used as a WHERE clause
    :param filters: list of filters in the form of (attribute, operator, values)
    for example [('city', 'in', ['Boston', 'New York']), ('id', '>', 10)]
    :return: SQL string and list of parameters
    """
    sql_filters = None
    sql_params = []
    for filter in filters:
        attribute = filter[0]
        operator = filter[1]
        values = filter[2]
        if type(values) in (list, tuple):
            sql_filter = '"{}" {} ('.format(attribute, operator) + ','.join(
                ['%s' for x in values]) + ')'
            sql_params.extend(values)
        else:
            sql_filter = '"{}" {} %s'.format(attribute, operator)
            sql_params.append(values)
        if not sql_filters:
            sql_filters = sql_filter
        else:
            sql_filters = sql_filters + ' AND ' + sql_filter
    return sql_filters, sql_params

pgio = PostgisIO(
    table='towns',
    columns='town,pop2010',
    filter=[
        ('type', 'in', ['C','TC']),
        ('town','=','NORTH ADAMS')
    ]
)
query, params = pgio.get_query()
print query, params
conn = psycopg2.connect(pgio.get_connection_string())
try:
    cur = conn.cursor()
    cur.execute(query, params)
    print cur.fetchall()
finally:
    conn.close()

manthey commented 8 years ago

I've created a new repository here with my initial work on this. It doesn't (yet) have any sort of UI, and doesn't (yet) support using functions as part of the queries. There are probably other issues still that will be resolved as it gets developed.

This inspects the table to determine the available columns and their basic datatypes. I've based some of the filtering from @mbertrand's example.

It includes some code from geoapp which can do polling on changing databases, though I don't think there is any application for that here yet.

aashish24 commented 8 years ago

Thanks @manthey we will have a look at it.

kotfic commented 8 years ago

In terms of the database introspection, is there a reason to implement (and eventually support) this ourselves rather than relying on several of the other hardened implementations that exist (e.g. SQLAlchemy)?

manthey commented 8 years ago

I'm not sure what SQLAlchemy would gain us in this case.

Postgres reports whether, for instance, columns are fundamentally numeric or string or something else. SQLAlchemy doesn't expose this -- you get the data type (such as "VARCHAR()" or "INTEGER()" or "NUMERIC(precision=10, scale=0)"), but then we'd have to figure out what the underlying datatype of those would be if we wish to handle them differently. If we were using something other than Postgres, you'd have to know what that db's datatypes are separately.

kotfic commented 8 years ago

SQLAlchemy gains us well documented well tested database reflection and querying that we're not obligated to maintain. Can you expand on what you mean by "if we wish to handle them differently"? In what cases would knowing the postgres specific datatypes be necessary?

manthey commented 8 years ago

If you know a datatype is numeric, then you know that comparisons should be done on numeric data types (4 < 10, but "4" > "10"). Depending on how datatypes are exposed, it might be nice if a comparison value is converted appropriately.

I'm not opposed to using SQLAlchemy, though I find it somewhat cumbersome. I can't tell from the documentation, for instance, if it will cast a numeric column to match a text comparison value, or the text value to match the numeric column, or just pass both to the underlying sql engine and hope for the best.

kotfic commented 8 years ago

Maybe this answers your questions?

aashish24 commented 8 years ago

@manthey the API looks great for the first part we discussed. How should I try it out? The part2 of this would be to support custom functions / queries as discussed.

manthey commented 8 years ago

@aashish24 sure, give it a try. There is no UI for setting up the database access yet -- you have to set the appropriate data on the POST item/{id}/database endpoint. Also, there are no tests (or, at least, no useful tests).

I could see, if we were ambitious, having a UI section on the item's page that would let you construct a query via a web form and get back a table of data.

Regarding custom functions: generically, it would be nice if we can support all non-volatile functions. I know how to query Postres to get such a list (but I haven't found where it is exposed in SQLAlchemy). As an alternative to limiting functions to those that are non-volatile (or, perhaps, in addition to), we can make sure all sessions with the database are read-only (but this isn't the same as not calling volatile functions, since things like setseed() can be called which change the state of the random number generator).

kotfic commented 8 years ago

@manthey probably you have already seen this but you can query the database directly as needed.

manthey commented 8 years ago

@aashish24 the lastest check-in supports functions. I'll add tests next. @kotfic I switched to using SQLAlchemy. I still have to reach into postgres's internal tables to determine which functions should be allowed, as SQLAlchemy doesn't expose that, but otherwise everything is going through SQLAlchemy's orm methods. This means that we can use any database SQLAlchemy supports (though we only support functions in the query in Postgres at the moment).

matthewma7 commented 6 years ago

The feature has been implemented similarly as @manthey said above. Now girder server talks to Postgres database defined with database_assetstore through database_assetstore and visualizes geospatial or tabular data from the Postgres. https://github.com/Kitware/minerva/pull/410

Kitware / minerva

Provide PostGRES support for storing large and geospatial data #135