andrewxhill / MOL

The Map of Life
mol.colorado.edu/
19 stars 4 forks source link

Bulkloaded sample layer data #103

Closed tucotuco closed 13 years ago

tucotuco commented 13 years ago

The 'scientificname' field in the sample data (from andrewxhill/MOL/master) is '99.00000000000' after loading to App Engine. Is this true to the sample data? A configuration problem? Or a bug in loader.py?

eightysteele commented 13 years ago

Is this true to the sample data? A configuration problem? Or a bug in loader.py?

Yeah, looks like a configuration problem maybe? Take a look at the sample config.yaml file on line 82: https://github.com/andrewxhill/MOL/blob/master/workflow/mol-data/jetz/config.yaml#L82

It's mapping scientificname to the BIOME DBF field.

eightysteele commented 13 years ago

Reassigned to @gaurav

gaurav commented 13 years ago

Well, that was easy: once I removed the incorrect mapping to BIOME, loader.py correctly picked up the scientificname from jetz/config.yaml's Collections: Required section. Committed as @1f8169ce6bec17387e11f2d6a986237e773ece00 and closing this issue.

tucotuco commented 13 years ago

Gaurav closed an Issue! Ex! Good times ahead!

eightysteele commented 13 years ago

The scientificname DBF column is actually a required field, so we don't want to delete it.

The loader.py script currently calculates the required and optional DBF fields using the source config.yaml file. Instead, it should calculate the required and optional fields from the Fusion Table. That way, if the source config.yaml file is incorrect (like the one above after deleting the scientificname), the loader.py script can detect it. Does this make sense?

Re-opening issue.

gaurav commented 13 years ago

Ah, okay! I assumed that the Collections: Required section in config.yaml were default values for individual columns in the DBF file (so that you didn't need to enter, say, a country for every row if all the rows in that dataset were from the same country). But instead: config.yaml should contain only layer-level metadata, while the DBF file should contain only shape-level metadata, right?

So to fix this bug completely, I also need to check loader.py's validation code and figure out why that script isn't correctly identifying required/optional fields from MOLSourceFields.csv and MOLSourceDBFfields.csv (at the moment). This also raises another question: what happens if the layer and shape metadata disagree, if - say - config.yaml says that all the records are from the USA, but there are records in the DBF file from Canada?

eightysteele commented 13 years ago

config.yaml should contain only layer-level metadata, while the DBF file should contain only shape-level metadata, right?

Lesse, so properties in Collections:Required should include name values from the Fusion Table (http://www.google.com/fusiontables/DataSource?dsrcid=1330559) where type = source AND required = y. Similarly, properties in Collections:Optional should include name columns where type = source AND required != y.

Then properties in Collections:DBFMapping:Required should include name values where type=dbf AND required=y and properties in Collection:DBFMappings:Optional should include name values where type = dbf AND required != y.

So to fix this bug completely, I also need to check loader.py's validation code and figure out why that script isn't correctly identifying required/optional fields from MOLSourceFields.csv and MOLSourceDBFfields.csv (at the moment).

So check out the Collection class in loader.py. It has methods like get_columns and get_mapping that should be returning properties from the Fusion Table instead of from the self.collection which is constructed from per-source config.yaml files. Does that make sense?

This also raises another question: what happens if the layer and shape metadata disagree, if - say - config.yaml says that all the records are from the USA, but there are records in the DBF file from Canada?

Totally. We'll have to figure out a more complete set of validation rules that must hold in order for loader.py to complete. I'm hoping that you and @tucotuco will discover and code for these rules through the testing process.

gaurav commented 13 years ago

Okay, this is starting to make sense to me! I'll get cracking on modifying loader.py tomorrow. Another quick question: does this mean workflow/metagen.py will also be deprecated in the future, or does it have another function which we will need to modify so that it accesses Google Fusion Tables directly?

eightysteele commented 13 years ago

Another quick question: does this mean workflow/metagen.py will also be deprecated in the future, or does it have another function which we will need to modify so that it accesses Google Fusion Tables directly?

Yeah, IMO, definitely want to modify metagen.py to hit FT directly. It's just super handy to have that script since it generates a bunch of code that changes when the FT changes.

gaurav commented 13 years ago

Quick question: do we need to check whether 'optional' fields are always present? I can't imagine it'd be a problem if a Collection: Optional field is missing from config.yaml (since it is optional anyway), but if a Collections:DBFMapping:Optional field is missing, it's possible we'll lose information if (for whatever reason) that field is present in the DBF file.

I'm pretty sure the answer is "no, optional fields can be missing anywhere without failing validation", but I thought I'd check.

eightysteele commented 13 years ago

I'm pretty sure the answer is "no, optional fields can be missing anywhere without failing validation"

+1

gaurav commented 13 years ago

Okay, I'm basically done with this: you can take a sneak peek at https://github.com/gaurav/MapOfLife if you want. I need to rewrite all the current configuration files so they match up with the latest spec on Fusion Tables, and then I'll be ready to push to the master. I'll have that done tomorrow. It sounds like you two have fixed up most of the Unicode issues already, so I'll integrate and test that over the weekend as well.

Edit: mind you, all of this is just checking that the fields exist (or that fields not specified in the specification aren't used). I don't know if we want to check types on those fields as well; if so, let's open a different issue for that.

eightysteele commented 13 years ago

I need to rewrite all the current configuration files so they match up with the latest spec on Fusion Tables

I think metagen.py might do this for you?

all of this is just checking that the fields exist (or that fields not specified in the specification aren't used). I don't know if we want to check types on those fields as well; if so, let's open a different issue for that.

What you thinking @tucotuco?

tucotuco commented 13 years ago

I think there there will be a whole lot of validation work when the controlled vocabularies come into play. Maybe wait for that and do it all together? At least wait until we've done a review and testing following Walter's latest changes?

gaurav commented 13 years ago

Okay, I think I've gotten the check-field-names-from-Fusion-Table feature in as of @c51e27beb66f. I've also modified metagen.py so that it uses the Fusion Table instead of the local CSV files when generating its output. Unfortunately, LayerIndex bulkload has stalled again, probably because bulkload.yaml and bulkload_helper.py need to be updated. I'll open a new issue for that - this one is carrying too many little subissues already! Please reopen this one if loader.py and metagen.py aren't accessing the Fusion Table correctly in any way.