MapofLife / MOL

Integrating information about species distributions in an effort to support global understanding of the world's biodiversity.
http://mol.org
BSD 3-Clause "New" or "Revised" License
26 stars 2 forks source link

loader.py should be idempotent #4

Closed eightysteele closed 12 years ago

eightysteele commented 12 years ago

Running loader.py multiple times duplicates data in postgresql. Needs command line option for updating or ignoring based on provider-collection-filename constraint.

gaurav commented 12 years ago

Crap, I thought I'd fixed this. I'll assign this to myself once I'm back "on the job" later tonight.

gaurav commented 12 years ago

Ha, you commented out the line which was deleting the previous provider/collection information (see [https://github.com/MapofLife/MOL/commit/b6e063376b8059b7ac2ffe1a89fc2863451c45a7#L1L477](this edit)). This should be fixed in @d9e22b150949c24 - give it a try when you can and let me know if it works for you.

Note that I'm replacing data at the collection level. I think this is the right level, since I want to allow providers to rename filenames without worrying that the previous filename's layers will stay around in the system somewhere; it also works out well in our hypothetical provider UI, where they'd just have a list of collections, with the option to replace a collection with a new set of files. I also like the idea of a zip file containing a collection+config.yaml being uploaded to our server for immediate inclusion without worrying about uploading all the SHP files, etc.

gaurav commented 12 years ago

Fixed as of @c7d66df. I'll test it thoroughly we get POST working (issue #13) so we can test this with polygons, or we get more point-based datasets incorporated. I'll wrap that up later today unless there are higher priority tasks to work on.

eightysteele commented 12 years ago

Ha, you commented out the line which was deleting the previous provider/collection information

Oops, FHMP.

Note that I'm replacing data at the collection level.

Explain more? We still need a provider directory that contains multiple collection directories each with a config file and shapefiles. Right?

gaurav commented 12 years ago

Yup. The way we do idempotency right now is, right before uploading collection X for provider P, we delete any existing rows supposedly from collection X and provider P. I just wanted to make sure everybody was aware that we don't do this at the provider level (uploading 'iucn/mammals' won't cause 'iucn/birds' to be deleted) or at the layer level (uploading iucn/mammals/Panthera_tigris will cause all iucn/mammals records to be deleted). That makes sense to me, but I could be wrong!

gaurav commented 12 years ago

As of @66154cd513a00, we have two datasets and they are being deleted correctly when new collections are uploaded to them. Closing this issue.

gaurav commented 12 years ago

Okay, in trying to implement issue #27 (loader.py needs to be able to continue an incomplete download), idempotency is now broken. It appears to be machine-specific (i.e. computer A can properly identify its own uploads, but computer B cannot). This is probably a bug in the way the hash is being generated; but this is hard to test without having two side-by-side computers. I've started upgrading mol.colorado.edu; once this is up, hopefully by Monday, I'll be able to fix this pretty easily.

gaurav commented 12 years ago

Hmm, new theory: maybe the current date is in there somewhere? Maybe it's not per-machine as much as per-date?

gaurav commented 12 years ago

Okay, I've figured out why this is happening: ogr2ogr produces slightly different coordinates when running on different computers, probably because of floating point machinations. So that should be pretty easy to fix: figure out how many significant digits make sense with decimal latitude/longitude (and especially if we can figure that out from the shapefile somehow), and then truncate or round-off the lat/longs to that point.

eightysteele commented 12 years ago

Good find. Maybe truncate to 7?

gaurav commented 12 years ago

I've truncated to 6, which seems to work really well -- a lot of the coordinates are only accurate to two decimal places anyway, so - once rounded off - they end up as "12.230000" or whatever. So far, this appears to be working in terms of the hashes. I've started an upload of the mol_rangemaps from the server. If this does work, it should be a lot easier for me to track down the files which are failing and hopefully get mol_rangemaps up soon, maybe even by Sunday.