andrewxhill / MOL

The Map of Life
mol.colorado.edu/
19 stars 4 forks source link

Load metadata into PostgreSQL for use by the tiler #118

Open gaurav opened 12 years ago

gaurav commented 12 years ago

We need to load the metadata into PostgreSQL so that it can be used by the tiler to visualize the layers using metadata values. This should happen as part of usual loader.py behavior, and should happen before we upload to Google App Engine, so that the App Engine never knows about layers which the tiler doesn't know about.

The simplest way to do it for now either be (1) as a single field in PostgreSQL which stores a JSON document, which we can then search through, or (2) just setting up the table based off the fields and characteristics in our Fusion Table, and INSERT, UPDATE, COPY or pgloader. I'll look to see what fits better with the way loader.py already works, and implement that now.

gaurav commented 12 years ago

I'm mostly done with this. I eventually opted to use COPY, since I didn't have to worry about field names as long as the field names in the CSV file and in the database stay in sync, which -- thanks to validation and the new get-db-schema.py script -- shouldn't be too hard.

At the moment, my code (@d1ce6ba) has two flaws:

  1. It adds new data to the table as new rows, so across multiple runs, it's just adding the same layers over and over again to the database. I think the best way of handling this would be to delete all rows referring to this collection and layer combination (i.e. 'iucn'/'mammals') before we do the update. That way, both 'iucn'/'amphibians' and all 'jetz' layers are unaffected when 'iucn'/'mammals' layers are updated, but at the same time all previous 'iucn'/'mammals' metadata will be deleted if the 'mammals' dataset is reuploaded. We can do it in the same transaction, so if the COPY fails for any reason, the deletion will not be committed and the database will remain in its previous state. I think this makes more sense than updating each row individually, but I'd like to know what you think.
  2. It doesn't actually count the number of rows updated. I can't figure out how to get that number out of psycopg2, so I'm wondering if it's worth it (especially considering our TDWG deadlines) to add SELECTs before and after in the same transaction to check for row additions. Otherwise, I could just open a low-priority issue on this and leave it for when somebody's free next.
gaurav commented 12 years ago

I just had a discussion with @eightysteele about this, and he suggested:

  1. We add fields for provider, collection and filename into the field specification and thus into the metadata database, since they'll be useful for checking where the data originated from. This will require 'injecting' those fields into config.yaml at some point (since loader.py knows where they are already, it seems a shame to force users to fill them in).
  2. Before adding new data, we delete everybody having the same provider and collection (since we're basically working at the resolution of the collection). That way, the previous collection is wiped and the updated collection can be loaded to replace it. Since they happen on the same transaction, the "deleted" collection won't actually be deleted unless the new metadata could be loaded successfully.
  3. Add indexing for the indexed fields (they're all TEXT fields for now; I'll sort that out when I fix issue #121).
  4. We won't actually count the number of rows required yet, but we'll keep this issue open until that is sorted out (it shouldn't take long, but I think it's pretty low priority compared to everything else we want to do this week).
gaurav commented 12 years ago

As of @ce6c4ab061, loader.py will delete the last collection before uploading a new collection. So we've met every criteraia of this issue except indexing and row counts. I'll work on that once the GIS integration happens.