RDCEP / EDE

MIT License
2 stars 1 forks source link

replace raster2pgsql with your own version (if not too cumbersome) #7

Closed ghost closed 8 years ago

ghost commented 8 years ago

raster2pgsql is not flexible enough for our needs because:

i am now using python netcdf4 + sqlalchemy libraries to have a custom ingestion of a netcdf's actual data. this also has the advantage that the entire ingestion, i.e. data + metadata is handled entirely within python and not the metadata within python and the actual data with the raster2pgsql command line tool. that way it's more centralized + easier todo rollback if there's a failure say when ingesting the data but not the metadata (and we thus have to undo the ingestion of the metadata)

besides, doing that will also bring me closer to answering the question @ricardobarroslourenco had about whether it's going to be a numpy array or what that he has to work with

njmattes commented 8 years ago

Bummer about raster2pgsql—always nice to be able to use existing tools, but it does sound like it'll be better in the long run to have our own ingester. And probably if we focus on Joshua's datasets to begin, it'll be pretty quick to spin it up.

ghost commented 8 years ago

another point is that though raster2pgsql somehow extracts the spatial resolution, eg. (0.5, 0.5) (degrees) for (lat, lon) for some of our datasets ive tried and it stores them in a custom designated table called raster_columns as fields scale_x respectively scale_y, those fields are NULL (or whatever other value they choose) if the rasters do not have the same spatial resolution. the behavior of PostGIS' raster_columns table is described in more detail here: http://postgis.net/docs/manual-2.2/using_raster_dataman.html#RT_Raster_Columns. so since we are likely going to have rasters of different resolutions in the same rast column in netcdf_data, the fields scale_x and scale_y become useless and we would have to create our own table that keeps track of the different scale_x and scale_y 's anyway. not to mention that in the further future we might encounter that a single netcdf does not have uniform latitude/longitude steps, i.e. it's not a regular grid of (lat,lon)'s. but certainly in the nearer future we're going to encounter different netcdfs with different (lat,lon) resolutions and already for that case, as just described above, raster2pgsql's scale_x and scale_y capability is not going to be of any use.

so to this end, we need more control in determining the resolution and whether it's even well-defined, i.e. whether we even have uniform lat + lon step sizes. and that is another reason why we are better off using python's netcdf library to analyze the lat + lon numpy arrays (yes, they are just numpy arrays so it's going to be a charm to work with them, see: http://www.hydro.washington.edu/~jhamman/hydro-logic/blog/2013/10/12/plot-netcdf-data/)

ghost commented 8 years ago

extracting the netcdf data as numpy arrays, i.e. separate arrays for lat, lon, time, and the variable(s) is easy and is described in: http://www.hydro.washington.edu/~jhamman/hydro-logic/blog/2013/10/12/plot-netcdf-data/

however, im currently working on ingesting these numpy arrays into postgres using sqlalchemy (instead of using raster2pgsql), and in particular using it's RasterElement type but that's not quite working yet because of https://github.com/geoalchemy/geoalchemy2/issues/87 (plan is to wrap a numpy array into a RasterElement object and then just use SQLAlchemy's add_item etc. to ingest it into netcfd_data).

ghost commented 8 years ago

using python directly instead of raster2pgsql turns out to be nasty because when i read the raster field from postgis i get back a binary format within python for which there is no parsing tool available yet. also, i just managed to get raster2pgsql to work on one of @abrizius files and the problem seemed to be that raster2pgsql can't handle multiple variables at once so i first extracted the individual variables through the ncks command line tool.

i also sent a mail to postgis users mail list because i want to make sure that the multiple variables was the issue and not some other convention that raster2pgsql might expect from the netcdf which would make us run into problems later again

what remains is the problem with the ID field which we can either forget, i.e. we just use the dataset name as the primary key and make sure that's unique or i can write a little hack to make raster2pgsql also add the ID into netcdf_meta. however, this requires me to change the sql script output of raster2pgsql by injecting the ID as an additional column to be added into the sql script and then run that modified sql script into postgres which is cumbersome given that the gain is little, at least for now. so i suggest using dataset_name as the primary key. not least we can add fields to the dataset_name that make it unique, e.g. aet_whe_modelName_simulationParamstimestamp... etc. (the timestamp would make it of course really unique). what do you think @njmattes about that?

njmattes commented 8 years ago

Yes, I've been fiddling a bit with geoalchemy2.types.Raster for a bit, and it's pretty ugly.

ghost commented 8 years ago

but now that i think of it we still have the same more serious problem that when we read a raster from postgis it's going to be in this ugly binary format we can't parse yet... (as you can see when you read a raster column from postgres and then look at the data field of the RasterElement object, see also https://github.com/geoalchemy/geoalchemy2/issues/87) in particular, it's not a nice numpy array at that point :/

im looking into geoalchemy / postgis' ST_Value function as an option. @njmattes @ricardobarroslourenco let me know what other alternatives you find

njmattes commented 8 years ago

Perhaps: http://gis.stackexchange.com/questions/130139/downloading-raster-data-into-python-from-postgis-using-psycopg2

Related I think: http://postgis.net/docs/RT_FAQ.html#idp101075424

ricardobarroslourenco commented 8 years ago

geoalchemy2.types.Raster is really a weird class (at least on their documentation). For me it is still not clear how it is defined, and how it works, because on the geometry types definition it seems more a vector class description.

On the raster2pgsql usage, if gets stable, for initial tests is ok. However on the long run it is better to have a Python package to do so, because of exception control, and also because of possible future versions on this module that could change syntax. On the current ingestion module of ATLAS, we are using the NetCDF4 library. Despite being a binary output, it is interesting that you could using the reference (lat,long) grid provided on the nc4 file, you could assemble the entire netCDF, or even slicing it as you want. I suggest that you take a look on that file (https://github.com/RDCEP/atlas-viewer/blob/develop/atlas/nc4_to_mongodb.py), specially here (which defines the nc4 objects):

        self.nc_file = nc_file
        self.nc_dataset = Dataset(self.nc_file, 'r')
        self._lon_var = None
        self._lat_var = None
        self._time_var = None
        self._sim_context = None
        self._vals = self.nc_dataset.variables[self.sim_context][:, :, :]
        self._lats = self.nc_dataset.variables[self.lat_var][:]
        self._lons = self.nc_dataset.variables[self.lon_var][:]
        self._tims = self.nc_dataset.variables[self.time_var][:]

and here (how we iterated over the dimensions to construct pointwise GeoJSON, but however you are able to produce chunks for each dimension):

def ingest(self, sectors=1, sector=0):

        client = MongoClient(uri) if not MONGO['local'] \
            else MongoClient('localhost', MONGO['port'])
        db = client['atlas']
        points = db[MONGO['collection']]

        start_time = datetime.datetime.now()
        print('*** Start Run ***\n{}\n\n'.format(start_time))

        lonlats = itertools.product(enumerate(self.lats), enumerate(self.lons))
        lonlats = np.array_split(np.array([x for x in lonlats]), sectors)[sector]

        try:
            for (lat_idx, lat), (lon_idx, lon) in lonlats:
                new_values = list()
                all_null = True
                try:
                    for i in range(len(self.tims)):
                        xx = self.num_or_null(self.vals[i, lat_idx, lon_idx])
                        if xx is not None:
                            all_null = False
                        new_values.append(xx)

                    if all_null:
                        continue

                    tile = GenerateDocument(
                        lon, lat, self.sim_context, new_values,
                        self.pixel_side_length[0], self.pixel_side_length[1],
                        self.nc_file).as_dict
                    result = points.insert_one(tile)

                except:
                    print('Unexpected error:', sys.exc_info()[0])
                    raise
                # print '*** Inserted {} Points ***'.format(len(new_points))
                # print result.inserted_ids
                # print '*** End Points ***'
                tile = {}
                new_values[:] = []

With this, coupled with a defined database model, is possible to ingest using sqlalchemy, or even using psycopg which could be more performatic. I see one possibility on paralleling each iteration over a certain dimension, for example (but this would depend on the database structure you define).

njmattes commented 8 years ago

I've no experience with raster2pgsql. Is there some obvious reason why it gives me this error: ERROR: Unable to read raster file: ... when I try to ingest one of our nc4 files with ede/utils/ingest_netcdf.py? Running PostGIS 2.2, GDAL 1.11, and Postgres 9.5.

ghost commented 8 years ago

i brought raster2pgsql to work also for @abrizius files. here's the command i used on the papsim.nc4 and papsim_wfdei.nc4 example files (in https://github.com/legendOfZelda/ede_test) (which are some our nc4 files and all of them are very similar so below command should work for your nc4 too):

raster2pgsql -s 4326 -a -M -F -n dataset_name -t 10x10 papsim.nc4 netcdf_data

now when i tried to run raster2pgsql on the files @abrizius gave me additionally i first got the error: https://lists.osgeo.org/pipermail/postgis-users/2016-February/041208.html + see answer mail for the solution which i tried and it worked (the problem was these netcdfs have multiple SUBDATASETS and so we need to specify which SUBDATASET we want to take when using raster2pgsql)

@ricardobarroslourenco at least for now i suggest we don't go that path because we really want to work with PostGIS' raster type because needless to say we then get quite some functionality for free within PostGIS/Postgres: http://postgis.net/docs/manual-2.1/RT_reference.html (including spatial / temporal selection & aggregation) + there's no reason i c why raster2pgsql syntax / behavior should change substantially in the near future.

@njmattes make sure raster2pgsql itself works first, i.e. without yet ingesting into postgres (i.e. just generate a .sql script because that's what raster2pgsql by itself does).

ghost commented 8 years ago

@ricardobarroslourenco to be precise: i want to use the SQLAlchemy Raster type only to create the tables (through SQLAlchemy's create_all) but not when reading from Postgres, i.e. i don't want to fill the raster returned from PostGIS into a SQLAlchemy Raster type because we can't work with it then (because the Raster type stores the raster coming from Postgres in a binary format, in the .data field and we can't parse that). to get the raster into python memory i suggest we use PostGIS' ST_PixelAsCentroids (which i confirm works) (directly through psycopg2 and not through SQLAlchemy) and also gives us the lat,lon values which enables us to construct the GeoJSON or do whatever with it. as to how to get the raster from Postgres directly into a file, i'm testing @njmattes's link: http://postgis.net/docs/RT_FAQ.html#idp101075424

@ricardobarroslourenco the SQLAlchemy (or better GeoAlchemy) Raster type is distinct from the Geometry type. GeoAlchemy provides good functionality for the Geometry type but basically no for the Raster type, that's the difference. the only class field the Raster type has is basically the data field. as you can find out it is of type string and contains the raster in some PostGIS-internal binary format which is i think the same as you can see in the first code block on http://postgis.net/docs/manual-2.1/RT_reference.html and that format we obviously can't work with. that's the problem with the Raster type, i.e. they basically have not yet implemented any ORM functionality for it yet!

@ricardobarroslourenco @njmattes also, as a clarification: Q: what really is PostGIS' rast column type / what does raster2pgsql do? A: each rast field in the netcdf_data table contains :

-> for a fixed netcdf --> for a fixed subdataset within it (you can see the subdatasets within a netcdf by doing gdalinfo my_netcdf, see gdalinfo doc. in @abrizius's files the subdatasets are nothing but the variables whereas in the datasets on ATLAS there are no subdatasets / there is just a single (proper) variable) ---> for a fixed tile (raster2pgsql does tiling on (lat,lon) through the -t flag so it has some intelligence to spot the lat, lon dimensions)

it contains:

the values of the variable / subdataset for all the bands (the bands are the time frames, i.e. band n = year n)

njmattes commented 8 years ago

@legendOfZelda I'm not able to get raster2pgsql to generate SQL from the papsim nc4 files on any of the three computers I've tried. Always ERROR: Unable to read raster file: .... Is there some nuance to how Postgres or PostGIS are installed or configured?

ghost commented 8 years ago

what does ncdump -c papsim.nc4 give? (i am suspecting it's some unrecognized format issue because maybe GDAL was not compiled with HDF5 support)

also, what's the raster2pgsql command you just used? and is ERROR: Unable to read raster file: ... the complete error message?

njmattes commented 8 years ago

ncdump -c z.nc4

netcdf z {
dimensions:
    lon = 720 ;
    lat = 360 ;
    time = UNLIMITED ; // (34 currently)
variables:
    double lon(lon) ;
        lon:units = "degrees_east" ;
        lon:long_name = "longitude" ;
    double lat(lat) ;
        lat:units = "degrees_north" ;
        lat:long_name = "latitude" ;
    double time(time) ;
        time:units = "growing seasons since 1979-01-01 00:00:00" ;
        time:long_name = "time" ;
    float yield_whe(time, lat, lon) ;
        yield_whe:_FillValue = 1.e+20f ;
        yield_whe:units = "t ha-1 yr-1" ;
        yield_whe:long_name = "yield" ;
data:
 lon = -179.75, -179.25, ... 179.75 ;

 lat = 89.75, 89.25, ... -89.75 ;

 time = 1, 2, ... 34 ;
}
njmattes commented 8 years ago

raster2pgsql -s 4326 -a -M -F -n dataset_name -t 10x10 z.nc4 netcdf_data > z.sql

ERROR: Unable to read raster file: z.nc4

ghost commented 8 years ago

im using: PostGIS 2.2.1, GDAL 1.10.1, and Postgres 9.4.5 for which it works. however, i don't think it's an issue with the versions rather that PostGIS was somehow not properly installed. i'd try to do a proper reinstall of PostGIS. i think i installed it through:

sudo apt-get install postgresql-9.4-postgis-2.1 pgadmin3 postgresql-contrib-9.4

see: https://trac.osgeo.org/postgis/wiki/UsersWikiPostGIS21UbuntuPGSQL93Apt

njmattes commented 8 years ago

I reinstalled on this machine just this morning. GDAL with Postgres support as well. Think I'll give up on this and focus on something else. As soon as Chameleon un-fucks itself, we're just going to have to reinstall and configure the entire architecture on that machine—so wasting time on another local deployment is probably an ever-continuing waste of time.

Although I have no way of testing anything, it does seem to be a good direction we're working in here.

ghost commented 8 years ago

ok, yep. just in case you feel like trying again here's a thread i found you might have come across already: https://lists.osgeo.org/pipermail//postgis-users/2012-November/035869.html (the person has the same error message you just described)

also, for testing, i am mostly working with my stripped-stripped-down version of EDE: https://github.com/legendOfZelda/ede_test (it has no flask, celery, html, js stuff, only sqlalchemy) and the setup instructions are in its README.

ghost commented 8 years ago

having an integer ID instead of the netcdf filename string as a column in the netcdf_data table is no problem and i solved it IMO in the least error-prone way as long as i make sure that ingest.py rolls back if raster2pgsql fails (that's going to be really crucial otherwise we ingest metadata but not the actual data which messes up the state of our database). the following example shows how we're going to add the ID field:

CREATE FUNCTION get_my_id()
    RETURNS bigint AS $$ SELECT last_value from netcdf_meta_dataset_id_seq $$
    LANGUAGE SQL;
CREATE TABLE rasts_test (
rid serial,
vid integer not null default get_my_id(),
rast raster);

(rasts_test here is going to replace netcdf_data and the vid field replaces the current dataset_name and actually stands for the ID of a specific variable within a specific netcdf so does not actually point to a row in netcdf_meta but rather to a separate table i will construct additionally that only holds the variables with their IDs)

this way we achieve that whenever we ingest into netcdf_data the right ID is going to be added so that we can join a raster field with the corresponding variable, using that common ID (each raster field in netcdf_data belongs to exactly one specific variable within one specific netcdf).

ghost commented 8 years ago

@ricardobarroslourenco even if we use raster2pgsql we can still do a proper exception handling, namely, as follows:

try:
    ingest metadata of netcdf into postgres
    run raster2pgsql using python's subprocess module (***)
    session.commit()
except SQLAlchemyError:
    print "SQLAlchemy error"
except CalledProcessError as e:
    print e.cmd, e.returncode, e.output
finally:
    session.rollback() (rollback metadata ingestion)

the next piece of code below fleshes out the use of python's subprocess module (see line marked (***) in the above code). this piece of code allows us to have maximal information about which process failed, i.e. whether it was raster2pgsql or psql ... when we do the combined redirection command + it also gives us the return code, stdout, stderr of both commands which then allows us to do a fine-grained exception handling along the lines of the above code:

from subprocess import *
cmd1 = ["ls", "-#"]
cmd2 = ["head", "-n", "3"]
p1 = Popen(cmd1, stdout=PIPE, stderr=PIPE)
p2 = Popen(cmd2, stdin=p1.stdout, stdout=PIPE, stderr=PIPE)
p2_out = p2.communicate()[0]
p1_out = p1.communicate()[0]
if (p1.returncode != 0)
    raise CalledProcessError(returncode=p1.returncode, cmd=' '.join(cmd1), output=p1_out)
if (p2.returncode != 0)
    raise CalledProcessError(returncode=p2.returncode, cmd=' '.join(cmd2), output=p2_out)

it really pays off to have ingest.py do the error handling + rollback properly because otherwise, as already mentioned, but i'm going to stress it again, we can likely end up with the ingestion of the metadata having succeeded yet not the ingestion of the actual data, putting our database into a very bad inconsistent state.

ricardobarroslourenco commented 8 years ago

@legendOfZelda I'm looking into PostGIS raster functions in a local environment to see if we could use it. There is a method that once you define an empty raster, it is possible to ingest several layers (more precisely bands) quite similar to GDAL does. It would be using ST_MakeEmptyRaster and then ST_AddBand for each layer.

Using raster2pgsql is documented, but we would be using it via a subprocess call, which opens a possibility for a shell injection exploit, which it is not nice, specially if this ingestion could be done in a client over the network.

ghost commented 8 years ago

yes, for temporal aggregation we would be using ST_{MakeEmptyRaster,AddBand} and most importantly ST_MapAlgebra (the callback function version). temporal aggregation is a little cumbersome to implement within PostGIS but all other ops we need are quite straightforward, see #9.

im not sure that there's going to be a shell injection vulnerability because the shell argument of Popen is false by default so the raster2pgsql command is run directly and not through the shell.