RDCEP / EDE

MIT License
2 stars 1 forks source link

automatic propagation of IDs in tables #12

Closed ghost closed 8 years ago

ghost commented 8 years ago

As we agreed, the tables we are now working with are:

https://github.com/legendOfZelda/ede_test/blob/master/schema/models.py

However, there's a little problem now. When we e.g. ingest into netcdf_meta the gid must be the same as the corresponding row in global_meta and similarly for the variable + raster IDs for the other tables.

Now there are basically two solutions two autopropagate these IDs: using multiple message exchanges between the ingest.py client and the Postgres server, or using triggers etc. within Postgres.

If we use multiple message exchanges I always need to get the ID I just ingested so I can use it for the subsequent query, i.e. i would first ingest into global_meta returning the gid and with that ingest into netcdf_meta etc. however, for making a SQL query return a field (here an ID) i'd have to use psycopg2 directly AFAICS. now that wouldn't be too much of a hassle but the real problem is that we're using raster2pgsql which is run using the subprocess module and thus has its own postgres session and i cannot make it return to me the raster IDs it just ingested to populate those to the netcdf_linker table.

so the big issue is that we're running raster2pgsql as a separate executable and not fully through python.

now if we use the second approach, i.e. triggers, defaults etc within postgres then i would have to set for the gid in the netcdf_meta table as the default the gid that was last ingested into global_meta to make it be autopropagated. but this approach has quite some disadvantages. first, i cannot fully confirm yet that all the autopropagations we need can be done this way. second, even if we can, then this makes it impossible later to ingest netcdfs in parallel (i.e. to run multiple instances of ingest.py on different netcdfs) or to append bands to an already existing raster.

another approach we can choose is to run raster2pgsql on its own, i.e. without yet ingesting, i.e. just producing the .sql file. i can then open that .sql file through python, modify that query by making it return the raster IDs as well, and then execute it in raw form. then i would get back the raster IDs within python and can use them to fill up netcdf_linker with the correct IDs. that would then be the last message exchange between client and server in the first approach, i.e. the order in which i'd fill the tables are: global_meta, netcdf_meta, netcdf_vars, netcdf_data, netcdf_linker. however, again this approach won't work because ingest.py and raster2pgsql use different sessions.

actually, there is a much better approach which is todo it all within python, and this coincides with what @ricardobarroslourenco said in #7. turns out the raster2pgsql C-executable used to be a python script in a former version of PostGIS (https://trac.osgeo.org/postgis/browser/trunk/raster/scripts/python/raster2pgsql.py). it's probably not compatible anymore with the current PostGIS version but i can make it compatible again and then we can do it all within python. how about this approach?

@raffmont @njmattes @ricardobarroslourenco i strongly suggest we use the last approach.

njmattes commented 8 years ago

For the sake of argument (and for the sake of using pre-built tools), how about the following (which involves slightly different tables):

Use raster2pgsql to generate an SQL file. Ingest global_meta and return the gid. Then ingest netcdf_meta with a gid column which is a one-to-one foreign key to global_meta. Then netcdf_vars and a netcdf_vars__netcdf_meta table which is a many-to-many map between the two netcdf tables. Then use re to alter the SQL file, and ingest the rasters into a table that also contains a gid column.

I could never get raster2pgsql to work with the NetCDF files I have so I don't know what the SQL looks like. So this may be impossible (or at least implausible).

The downside to writing our own python is that we have to ensure it continues to work for future version of PostgreSQL and PostGIS.

ghost commented 8 years ago

i see, yes, this should work. let me do that now.

very stupid question: when you ran raster2pgsql was the path to the netcdffile correct? (just to be really sure)

njmattes commented 8 years ago

Not a stupid question, but yes I did check the path. I even tweaked the permissions in various ways but still no go.

njmattes commented 8 years ago

Seems easy enough to tweak with re.

ghost commented 8 years ago

yes, i uploaded the file into our ede folder on the machine you setup (out.sql). I'm parsing it now. yep, just some re and we should be good to go.

ricardobarroslourenco commented 8 years ago

@legendOfZelda I've downloaded the out.sqlfile, and tried to load it on my local machine. I've got a message that raster_test relation doesn't exists. Did you use a different table structure from models.py ?

ghost commented 8 years ago

@ricardobarroslourenco yes, i renamed the table to rests_tests at some point. you can of course rename it manually in out.sql to make it work for your table, also if you need to add additional fields which is what me and @njmattes discussed above.

ricardobarroslourenco commented 8 years ago

@legendOfZelda Alright, now loaded. Just a question. On the raster_test table, is each line a tile?

ghost commented 8 years ago

yes, each rast field in a row of raster_test is a tile and that tile has all the bands (i.e. all the time frames in our setting). that's how raster2pgsql fills it in.

ricardobarroslourenco commented 8 years ago

Alright. Just to remind myself, the rid of each tile would be gathered on the NetCDF_Linker table, in terms of all tiles of a same netCDF; then using respective's gid to relate with the netCDF header data stored on the NetCDF_Meta table at the meta_info column, right?

ghost commented 8 years ago

this was the schema we agreed on last monday:

global_meta gid filename size date of creation date of insertion date of last access

netcdf_meta mid json gid

netcdf_data rid rast

netcdf_linker gid vid rid

netcdf_vars vid name

however, because of the problems me and @njmattes just discussed above (the main issue we discussed was that raster2pgsql uses its own postgres session, different from the one we use within ingest.py leading to potential inconsistencies in the tables / interleavings of transactions / i cannot easily get the rids (that are created due to auto increment) when running raster2pgsql, into python) we decided to use this schema instead:

global_meta gid filename size date of creation date of insertion date of last access

netcdf_meta mid json gid

netcdf_data rid mid vid rast

netcdf_vars vid name

@njmattes correct me if I'm wrong here with this schema. the way ingestion now works is that we ingest into global_meta + get the gid returned, then ingest into netcdf_meta and get the mid returned, then ingest into netcdf_vars and get the vid. finally, having both mid and vid we ingest into netcdf_data using raster2pgsql but first having modified the out.sql by appending mid and vid manually into out.sql before ingesting.

these individual ingestions will take multiple rounds between ingest.py and postgres and i'm using raw SQL queries with SQLAlchemy or psycopg2 directly for that.

ricardobarroslourenco commented 8 years ago

Ok. Just one doubt. What is different between mid and vid?

njmattes commented 8 years ago

This is my understanding, slightly different than @legendOfZelda's above.

global_meta

id            | int  | primary key
filename      | str  | name of source file
filesize      | int  | size of source file
date_created  | date | date file was created
date_inserted | date | date data was inserted into db
date_accessed | date | data data was last accessed by user

netcdf_meta

id       | int  | primary key
gid      | int  | foreign key to global_meta
metadata | json | NetCDF metadata as JSON blob

netcdf_data // Should this be netcdf_data or global_data? Is all gridded data stored in the same format?

id     | int | primary key
gid    | int | foreign key to global_meta
vid    | int | foreign key to netcdf_vars
raster | bin | binary blob of all bands in tile

netcdf_vars // Again, are these actually NetCDF or global?

id   | int | primary key
name | str | Name of variable
ghost commented 8 years ago

i c so the same variable can appear in a non-netcdf file, e.g. in a csv or another. i should add here that SQLAlchemy complains if there are no primary keys for a table, so we always have to have one. however, we might be able to get around that by defining a unique index on multiple columns, i.e. a composite unique index, but I'm looking into that.

njmattes commented 8 years ago

Yes, I think the only thing that's unique to NetCDF is the netcdf_meta.

ricardobarroslourenco commented 8 years ago

the ESRI Shapefile standard could also have metadata. Perhaps storing the JSON in the global_meta could be interesting.

Other thing. Shouldn't we include in the global_meta table a column filetype? Or we would parse it from the filename?

njmattes commented 8 years ago

If we're able to assume that any file that will contain metadata is able to stare that metadata as a JSON blob, then yes we could store it on the global_meta table, and store filetype in a separate column.

ghost commented 8 years ago

yes, that makes sense: adding another column : filetype to global_meta + moving the JSON blob from netcdf_meta to global_meta thus making netcdf_meta redundant. JSON should work for pretty much any data we're working with because most metadata is of key-value form and JSON certainly covers that but also arrays + nested JSONS so we should be good.

@njmattes the netcdf_data table however must certainly contain not only gid but also vid because a rast field is for a specific variable within a specific netcdf how about:

netcdf_data rid gv_id rast

vars_meta gv_id gid vid

i know this looks cumbersome and makes one want to get rid of this additional gv_id but we really need to know not only the gid but also the vid a rast field belongs to (raster2pgsql ingests one variable at a time and thus each rast field is for a specific variable (within a specific netcdf)).

ricardobarroslourenco commented 8 years ago

Looking at the GDAL supported raster formats there are 14 formats that include metadata. I think we should apply your suggestion @njmattes .

ghost commented 8 years ago

i agree with your schema @njmattes 2. for now i'm using netcdf_vars + netcdf_data instead of global_... until we see more need for calling it global_... because of other kinds of data formats.

ricardobarroslourenco commented 8 years ago

@legendOfZelda When it is possible, could you grant me access to the database? When I tried to reach it through psql, my role didn' t exist.

njmattes commented 8 years ago

@ricardobarroslourenco I think this should work: sudo createuser -sdr -U postgres rlourenco

njmattes commented 8 years ago

Looking at the server at this moment, it looks like the var/www/ede directory holds the code from this repo (RDCEP/EDE), and the ede_test database is initialized from that? Should we also initialize a database with the new layout we've talking about that's in the legendOfZelda/ede_test repo?

ghost commented 8 years ago

it's kinda confusing at the moment with the naming of the database:

so it's kinda flipped. yes, i suggest we git clone https://github.com/legendOfZelda/ede_test into var/www/ede as well and only use that for development and use RDCEP/EDE to copy and paste (possibly) some flask / html / js code from. or even simpler, we can remove RDCEP/EDE and only work with ede_test.

njmattes commented 8 years ago

Where's the virtualenv that we're using?

ricardobarroslourenco commented 8 years ago

I think it is at:/var/www/ede/EDE/venv

ricardobarroslourenco commented 8 years ago

Just an update, I've activated pgAdmin support. It is listening at port 5432 as usual. For connection is needed to setup a password for your db account.

njmattes commented 8 years ago

img_1647

ghost commented 8 years ago

is it mixed with garlic? looks very good

njmattes commented 8 years ago

Garlic, shallots, almonds. With chipotle peppers, and cilantro.

ghost commented 8 years ago

i c, very good :+1:

njmattes commented 8 years ago

I won't hazard a guess as to its pH.

ricardobarroslourenco commented 8 years ago

Pretty nice dish :) On Feb 29, 2016 8:51 PM, "Nathan Matteson" notifications@github.com wrote:

I won't hazard a guess as to its pH.

— Reply to this email directly or view it on GitHub https://github.com/RDCEP/EDE/issues/12#issuecomment-190508421.

ghost commented 8 years ago

@njmattes @ricardobarroslourenco im resuming work this afternoon, have to get some other things done in the morning first. hope to get postgis up and running asap.

ghost commented 8 years ago

@ricardobarroslourenco : i installed postgis from source and raster2pgsql is now there. just a little question: when i tried to run python create_tables.py it gave me a ImportError: No module named ede_test.schema.models. do you get the same error? when i set PYTHONPATH it of course started to work but our plan was to avoid setting the PYTHONPATH. how come python create_tables.py still throws such an import error even after our directory restructuring. i can read it myself probably but if you have a quick answer that would be great.

ricardobarroslourenco commented 8 years ago

@legendOfZelda this is weird. Looking into the code, it seems that everything is ok. When you are in the venv is there any preset of PYTHONPATH?

When out of it, it is already set:

rlourenco@stockhausen:/$ python -c "import sys; print sys.path" ['', '/usr/lib/python2.7', '/usr/lib/python2.7/plat-x86_64-linux-gnu', '/usr/lib/python2.7/lib-tk', '/usr/lib/python2.7/lib-old', '/usr/lib/python2.7/lib-dynload', '/usr/local/lib/python2.7/dist-packages', '/usr/lib/python2.7/dist-packages']

@njmattes any hint on this?

ghost commented 8 years ago

echo $PYTHONPATH shows nothing, independent of whether i'm inside the virtual environment or outside of it. python -c "import sys; print sys.path" gives me

['', '/usr/lib/python2.7', '/usr/lib/python2.7/plat-x86_64-linux-gnu', '/usr/lib/python2.7/lib-tk', '/usr/lib/python2.7/lib-old', '/usr/lib/python2.7/lib-dynload', '/usr/local/lib/python2.7/dist-packages', '/usr/lib/python2.7/dist-packages']

when i'm outside of the virtual environment and

['', '/var/www/ede_test/venv/lib/python2.7', '/var/www/ede_test/venv/lib/python2.7/plat-x86_64-linux-gnu', '/var/www/ede_test/venv/lib/python2.7/lib-tk', '/var/www/ede_test/venv/lib/python2.7/lib-old', '/var/www/ede_test/venv/lib/python2.7/lib-dynload', '/usr/lib/python2.7', '/usr/lib/python2.7/plat-x86_64-linux-gnu', '/usr/lib/python2.7/lib-tk', '/var/www/ede_test/venv/local/lib/python2.7/site-packages', '/var/www/ede_test/venv/lib/python2.7/site-packages']

when i'm inside. ok, I'm ingesting the netcdfs from atlas now + a couple more from the root repo @abrizius gave me, should be done within some reasonable time (just depending on how long ingestion takes)

njmattes commented 8 years ago

Dunno why it's not working as expected, but I've just added a line to the venv/bin/activate that modifies the PYTHONPATH when the virtual environment is activated. Should work for now.

ghost commented 8 years ago

hm, that's what i didn't understand. i thought @ricardobarroslourenco 's directory restructuring would also achieve that the PYTHONPATH wouldn't have to be set anymore...i know it's a stupid detail but I'm still curious and probably pays off to learn it once and for all. but anyway, I'm ingesting the netcdfs now and let you know when that's done so we can move forward.

ricardobarroslourenco commented 8 years ago

I don't know. Perhaps is more a venv issue, rather than a dependency/coding problem.

ghost commented 8 years ago

@njmattes @ricardobarroslourenco I've ingested the 112 netcdfs from the atlas full_global directory + 109 netcdfs @abrizius pointed me to. let me know if there's any issues with the current schema we're using and/or whether my ingestion did something wrong.

the next step now is #9 which i already checked with the dummy raster in the postgis raster documentation: http://postgis.net/docs/manual-2.1/RT_reference.html, but feel free to convince yourself. i'm going to implement temporal aggregation within postgis and together with the queries in #9 we should have all the queries we need.