Closed ghost closed 8 years ago
For the sake of argument (and for the sake of using pre-built tools), how about the following (which involves slightly different tables):
Use raster2pgsql
to generate an SQL file. Ingest global_meta
and return the gid
. Then ingest netcdf_meta
with a gid
column which is a one-to-one foreign key to global_meta
. Then netcdf_vars
and a netcdf_vars__netcdf_meta
table which is a many-to-many map between the two netcdf
tables. Then use re
to alter the SQL file, and ingest the rasters into a table that also contains a gid
column.
I could never get raster2pgsql
to work with the NetCDF files I have so I don't know what the SQL looks like. So this may be impossible (or at least implausible).
The downside to writing our own python is that we have to ensure it continues to work for future version of PostgreSQL and PostGIS.
i see, yes, this should work. let me do that now.
very stupid question: when you ran raster2pgsql
was the path to the netcdffile correct? (just to be really sure)
Not a stupid question, but yes I did check the path. I even tweaked the permissions in various ways but still no go.
Seems easy enough to tweak with re
.
yes, i uploaded the file into our ede
folder on the machine you setup (out.sql
). I'm parsing it now. yep, just some re
and we should be good to go.
@legendOfZelda I've downloaded the out.sql
file, and tried to load it on my local machine. I've got a message that raster_test
relation doesn't exists. Did you use a different table structure from models.py
?
@ricardobarroslourenco yes, i renamed the table to rests_tests
at some point. you can of course rename it manually in out.sql
to make it work for your table, also if you need to add additional fields which is what me and @njmattes discussed above.
@legendOfZelda Alright, now loaded. Just a question. On the raster_test
table, is each line a tile?
yes, each rast
field in a row of raster_test
is a tile and that tile has all the bands (i.e. all the time frames in our setting). that's how raster2pgsql
fills it in.
Alright. Just to remind myself, the rid
of each tile would be gathered on the NetCDF_Linker
table, in terms of all tiles of a same netCDF; then using respective's gid
to relate with the netCDF header data stored on the NetCDF_Meta
table at the meta_info
column, right?
this was the schema we agreed on last monday:
global_meta gid filename size date of creation date of insertion date of last access
netcdf_meta mid json gid
netcdf_data rid rast
netcdf_linker gid vid rid
netcdf_vars vid name
however, because of the problems me and @njmattes just discussed above (the main issue we discussed was that raster2pgsql
uses its own postgres session, different from the one we use within ingest.py
leading to potential inconsistencies in the tables / interleavings of transactions / i cannot easily get the rids
(that are created due to auto increment) when running raster2pgsql
, into python) we decided to use this schema instead:
global_meta gid filename size date of creation date of insertion date of last access
netcdf_meta mid json gid
netcdf_data rid mid vid rast
netcdf_vars vid name
@njmattes correct me if I'm wrong here with this schema. the way ingestion now works is that we ingest into global_meta
+ get the gid
returned, then ingest into netcdf_meta
and get the mid
returned, then ingest into netcdf_vars
and get the vid
. finally, having both mid
and vid
we ingest into netcdf_data
using raster2pgsql
but first having modified the out.sql
by appending mid
and vid
manually into out.sql
before ingesting.
these individual ingestions will take multiple rounds between ingest.py
and postgres and i'm using raw SQL queries with SQLAlchemy or psycopg2
directly for that.
Ok. Just one doubt. What is different between mid
and vid
?
This is my understanding, slightly different than @legendOfZelda's above.
global_meta
id | int | primary key
filename | str | name of source file
filesize | int | size of source file
date_created | date | date file was created
date_inserted | date | date data was inserted into db
date_accessed | date | data data was last accessed by user
netcdf_meta
id | int | primary key
gid | int | foreign key to global_meta
metadata | json | NetCDF metadata as JSON blob
netcdf_data // Should this be netcdf_data or global_data? Is all gridded data stored in the same format?
id | int | primary key
gid | int | foreign key to global_meta
vid | int | foreign key to netcdf_vars
raster | bin | binary blob of all bands in tile
netcdf_vars // Again, are these actually NetCDF or global?
id | int | primary key
name | str | Name of variable
i c so the same variable can appear in a non-netcdf file, e.g. in a csv or another. i should add here that SQLAlchemy complains if there are no primary keys for a table, so we always have to have one. however, we might be able to get around that by defining a unique index on multiple columns, i.e. a composite unique index, but I'm looking into that.
Yes, I think the only thing that's unique to NetCDF is the netcdf_meta
.
the ESRI Shapefile
standard could also have metadata. Perhaps storing the JSON
in the global_meta
could be interesting.
Other thing. Shouldn't we include in the global_meta
table a column filetype
? Or we would parse it from the filename
?
If we're able to assume that any file that will contain metadata is able to stare that metadata as a JSON blob, then yes we could store it on the global_meta
table, and store filetype in a separate column.
yes, that makes sense: adding another column : filetype
to global_meta
+ moving the JSON blob from netcdf_meta
to global_meta
thus making netcdf_meta
redundant. JSON should work for pretty much any data we're working with because most metadata is of key-value form and JSON certainly covers that but also arrays + nested JSONS so we should be good.
@njmattes the netcdf_data
table however must certainly contain not only gid
but also vid
because a rast
field is for a specific variable within a specific netcdf
how about:
netcdf_data rid gv_id rast
vars_meta gv_id gid vid
i know this looks cumbersome and makes one want to get rid of this additional gv_id
but we really need to know not only the gid
but also the vid
a rast
field belongs to (raster2pgsql
ingests one variable at a time and thus each rast
field is for a specific variable (within a specific netcdf)).
Looking at the GDAL supported raster formats there are 14 formats that include metadata. I think we should apply your suggestion @njmattes .
i agree with your schema @njmattes 2. for now i'm using netcdf_vars
+ netcdf_data
instead of global_...
until we see more need for calling it global_...
because of other kinds of data formats.
@legendOfZelda When it is possible, could you grant me access to the database? When I tried to reach it through psql, my role didn' t exist.
@ricardobarroslourenco I think this should work: sudo createuser -sdr -U postgres rlourenco
Looking at the server at this moment, it looks like the var/www/ede
directory holds the code from this repo (RDCEP/EDE
), and the ede_test
database is initialized from that? Should we also initialize a database with the new layout we've talking about that's in the legendOfZelda/ede_test
repo?
it's kinda confusing at the moment with the naming of the database:
RDCEP/EDE
uses ede_test
right now as you can see in settings.py
legendOfZelda/ede_test
uses ede
as you can see in create_tables.py
so it's kinda flipped. yes, i suggest we git clone https://github.com/legendOfZelda/ede_test
into var/www/ede
as well and only use that for development and use RDCEP/EDE
to copy and paste (possibly) some flask / html / js code from. or even simpler, we can remove RDCEP/EDE
and only work with ede_test
.
Where's the virtualenv that we're using?
I think it is at:/var/www/ede/EDE/venv
Just an update, I've activated pgAdmin
support. It is listening at port 5432
as usual. For connection is needed to setup a password for your db account.
is it mixed with garlic? looks very good
Garlic, shallots, almonds. With chipotle peppers, and cilantro.
i c, very good :+1:
I won't hazard a guess as to its pH.
Pretty nice dish :) On Feb 29, 2016 8:51 PM, "Nathan Matteson" notifications@github.com wrote:
I won't hazard a guess as to its pH.
— Reply to this email directly or view it on GitHub https://github.com/RDCEP/EDE/issues/12#issuecomment-190508421.
@njmattes @ricardobarroslourenco im resuming work this afternoon, have to get some other things done in the morning first. hope to get postgis up and running asap.
@ricardobarroslourenco : i installed postgis
from source and raster2pgsql
is now there. just a little question: when i tried to run python create_tables.py
it gave me a ImportError: No module named ede_test.schema.models
. do you get the same error? when i set PYTHONPATH
it of course started to work but our plan was to avoid setting the PYTHONPATH
. how come python create_tables.py
still throws such an import error even after our directory restructuring. i can read it myself probably but if you have a quick answer that would be great.
@legendOfZelda this is weird. Looking into the code, it seems that everything is ok. When you are in the venv
is there any preset of PYTHONPATH
?
When out of it, it is already set:
rlourenco@stockhausen:/$ python -c "import sys; print sys.path" ['', '/usr/lib/python2.7', '/usr/lib/python2.7/plat-x86_64-linux-gnu', '/usr/lib/python2.7/lib-tk', '/usr/lib/python2.7/lib-old', '/usr/lib/python2.7/lib-dynload', '/usr/local/lib/python2.7/dist-packages', '/usr/lib/python2.7/dist-packages']
@njmattes any hint on this?
echo $PYTHONPATH
shows nothing, independent of whether i'm inside the virtual environment or outside of it. python -c "import sys; print sys.path"
gives me
['', '/usr/lib/python2.7', '/usr/lib/python2.7/plat-x86_64-linux-gnu', '/usr/lib/python2.7/lib-tk', '/usr/lib/python2.7/lib-old', '/usr/lib/python2.7/lib-dynload', '/usr/local/lib/python2.7/dist-packages', '/usr/lib/python2.7/dist-packages']
when i'm outside of the virtual environment and
['', '/var/www/ede_test/venv/lib/python2.7', '/var/www/ede_test/venv/lib/python2.7/plat-x86_64-linux-gnu', '/var/www/ede_test/venv/lib/python2.7/lib-tk', '/var/www/ede_test/venv/lib/python2.7/lib-old', '/var/www/ede_test/venv/lib/python2.7/lib-dynload', '/usr/lib/python2.7', '/usr/lib/python2.7/plat-x86_64-linux-gnu', '/usr/lib/python2.7/lib-tk', '/var/www/ede_test/venv/local/lib/python2.7/site-packages', '/var/www/ede_test/venv/lib/python2.7/site-packages']
when i'm inside. ok, I'm ingesting the netcdfs from atlas now + a couple more from the root repo @abrizius gave me, should be done within some reasonable time (just depending on how long ingestion takes)
Dunno why it's not working as expected, but I've just added a line to the venv/bin/activate
that modifies the PYTHONPATH
when the virtual environment is activated. Should work for now.
hm, that's what i didn't understand. i thought @ricardobarroslourenco 's directory restructuring would also achieve that the PYTHONPATH
wouldn't have to be set anymore...i know it's a stupid detail but I'm still curious and probably pays off to learn it once and for all. but anyway, I'm ingesting the netcdfs now and let you know when that's done so we can move forward.
I don't know. Perhaps is more a venv
issue, rather than a dependency/coding problem.
@njmattes @ricardobarroslourenco I've ingested the 112 netcdfs from the atlas full_global
directory + 109 netcdfs @abrizius pointed me to. let me know if there's any issues with the current schema we're using and/or whether my ingestion did something wrong.
the next step now is #9 which i already checked with the dummy raster
in the postgis raster documentation: http://postgis.net/docs/manual-2.1/RT_reference.html, but feel free to convince yourself.
i'm going to implement temporal aggregation within postgis and together with the queries in #9 we should have all the queries we need.
As we agreed, the tables we are now working with are:
https://github.com/legendOfZelda/ede_test/blob/master/schema/models.py
However, there's a little problem now. When we e.g. ingest into
netcdf_meta
thegid
must be the same as the corresponding row inglobal_meta
and similarly for the variable + raster IDs for the other tables.Now there are basically two solutions two autopropagate these IDs: using multiple message exchanges between the
ingest.py
client and the Postgres server, or using triggers etc. within Postgres.If we use multiple message exchanges I always need to get the ID I just ingested so I can use it for the subsequent query, i.e. i would first ingest into
global_meta
returning thegid
and with that ingest intonetcdf_meta
etc. however, for making a SQL query return a field (here an ID) i'd have to usepsycopg2
directly AFAICS. now that wouldn't be too much of a hassle but the real problem is that we're usingraster2pgsql
which is run using thesubprocess
module and thus has its own postgres session and i cannot make it return to me the raster IDs it just ingested to populate those to thenetcdf_linker
table.so the big issue is that we're running
raster2pgsql
as a separate executable and not fully through python.now if we use the second approach, i.e. triggers, defaults etc within postgres then i would have to set for the
gid
in thenetcdf_meta
table as the default thegid
that was last ingested intoglobal_meta
to make it be autopropagated. but this approach has quite some disadvantages. first, i cannot fully confirm yet that all the autopropagations we need can be done this way. second, even if we can, then this makes it impossible later to ingest netcdfs in parallel (i.e. to run multiple instances ofingest.py
on different netcdfs) or to append bands to an already existing raster.another approach we can choose is to run
raster2pgsql
on its own, i.e. without yet ingesting, i.e. just producing the.sql
file. i can then open that.sql
file through python, modify that query by making it return the raster IDs as well, and then execute it in raw form. then i would get back the raster IDs within python and can use them to fill upnetcdf_linker
with the correct IDs. that would then be the last message exchange between client and server in the first approach, i.e. the order in which i'd fill the tables are:global_meta
,netcdf_meta
,netcdf_vars
,netcdf_data
,netcdf_linker
. however, again this approach won't work becauseingest.py
andraster2pgsql
use different sessions.actually, there is a much better approach which is todo it all within python, and this coincides with what @ricardobarroslourenco said in #7. turns out the
raster2pgsql
C-executable used to be a python script in a former version of PostGIS (https://trac.osgeo.org/postgis/browser/trunk/raster/scripts/python/raster2pgsql.py). it's probably not compatible anymore with the current PostGIS version but i can make it compatible again and then we can do it all within python. how about this approach?@raffmont @njmattes @ricardobarroslourenco i strongly suggest we use the last approach.