XDATA-Year-3 / geoapp

XData GeoApp
Apache License 2.0
7 stars 2 forks source link

Initial support for connection to Elasticsearch. #197

Closed manthey closed 9 years ago

manthey commented 9 years ago

More work will be need to (a) support 'realtime' data feeds, (b) handle anything other than Instagram in a particular format, (c) properly deal with text searches unless we can setup the Elasticsearch mapping.

Update to the latest version of Girder.

Fix a bug in the 'realtime' postgres data access where data that is added to the system between fetching the first part of the initial set of rows and the subsequent part of the initial set of rows might not be retrieved.

jeffbaumes commented 9 years ago

Nice. What types of queries are currently possible on elasticsearch data?

manthey commented 9 years ago

Quite a bit. Elasticsearch has different terminology from other databases, and divides what would be in SQL where clause into 'filters' and 'queries'. We can search based on ranges (dates, distances, etc.), and on equality. We can get a consistent random sample (it is supported natively in ES). The performance seems very good in the Baltimore dataset.

The one issue I don't know how to resolve easily is full text search. You can turn it on and then any new rows ('documents' in ES terms) are searchable, but existing rows are not. You have to create a new schema ('reindex' in ES terms) with the indices ('mapping') already specified and copy over all of the tables ('types') from the existing schema.

I don't know what the backing hardware the ES instance that I am accessing has, but it is definitely a cluster of several machines. The Baltimore data set is actually a subset of the data within the schema I was accessing. Each 'document' is assigned a type. I pulled all of one schema, rather than a specific table.

I think it is possible to make the queries I need for real-time update with the existing mapping, but I haven't worked out how to do it.

manthey commented 9 years ago

In further investigation, I don't see any methods for getting data ingested since a particular time or _id . It might be possible if _timestamp mapping is turned on (with store = true) for an index and _type. Otherwise, I'll have to fake it by doing work on the server end of things (or loading the client excessively).