bkotyra / watershed_delineation_gpu

6 stars 2 forks source link

Testing and Validation Data #1

Open dankovacek opened 5 months ago

dankovacek commented 5 months ago

I'm curious if it's possible to test some sub-sample of the original data to try to replicating the results?

I have processed a small set of outlets on a (hydraulically conditioned) flow direction raster processed using Whitebox. I ran the single buffer (GPU) method on the dataset and am not able to validate results.

First, an offering -- I added a function to apply the affine transform from the original raster:

// GeoreferencingUtils.cpp
#include "GeoreferencingUtils.h"
#include <gdal_priv.h>
#include <iostream>

// Function to read georeferencing information from the input DEM
void readGeoreferencingInfo(const std::string& filename, double* geotransform, std::string& projection) {
    GDALDataset* dataset = (GDALDataset*)GDALOpen(filename.c_str(), GA_ReadOnly);
    if (dataset) {
        dataset->GetGeoTransform(geotransform);
        projection = dataset->GetProjectionRef();
        std::cout << "Read geotransform: ";
        for (int i = 0; i < 6; ++i) {
            std::cout << geotransform[i] << " ";
        }
        std::cout << std::endl;
        std::cout << "Read projection: " << projection << std::endl;
        GDALClose(dataset);
    } else {
        std::cerr << "Failed to open the input DEM: " << filename << std::endl;
    }
}

// Function to write georeferencing information to the output file
void writeGeoreferencingInfo(const std::string& filename, double* geotransform, const std::string& projection) {
    GDALDataset* dataset = (GDALDataset*)GDALOpen(filename.c_str(), GA_Update);
    if (dataset) {
        dataset->SetGeoTransform(geotransform);
        dataset->SetProjection(projection.c_str());
        std::cout << "Written geotransform: ";
        for (int i = 0; i < 6; ++i) {
            std::cout << geotransform[i] << " ";
        }
        std::cout << std::endl;
        std::cout << "Written projection: " << projection << std::endl;
        GDALClose(dataset);
    } else {
        std::cerr << "Failed to open the output file: " << filename << std::endl;
    }
}

I call this after the saveGDAL function at the end of main.cpp:

        std::cout << "saving results (" << resultsFilename << ")..." << std::endl;
        BasinIndexLoader::saveGdal(uniqueFilename, basinMatrix);

        // Write georeferencing information to the output file
        writeGeoreferencingInfo(uniqueFilename, geotransform, projection);

I would be happy to create a pull request if this is something you feel is worth incorporating.

I tested the code on a flow direction raster processed (and hydraulically conditioned) using Whiteboxtools, noting and adjusting for the different flow direction convention (integer direction representation is rotated one position clockwise).

In creating the outlet.txt file, I also adjusted the outlet positions to reflect 1-index as noted in your (helpful!) comments.

I tested a small set of points (green dots are outlets:

image

Looking closer at the largest sub-basin, the blue polygon was derived independently using WhiteboxTools. The black region underneath is the result of the single buffer method:

image

This is close enough to make me think I've made an error in indexing the outlet cell correctly (missing the small region at the top). Looking at another point shows the basin is missed:

image

Again suggesting the outlet point is slightly off. If I don't correct for the d8 pointer direction convention or the 1 vs. 0 indexing, the result is (slightly) different but similarly not capturing the basin:

image

If I change the methodology to 1 -- recursive / sequential (all else constant):

image

Methods 1, 3, and 5 yielded nearly identical results.

image

Method 2 ran for several minutes and I killed the process.

Any thoughts on where I should look to find my issue?

Thanks for posting this code -- I enjoyed your paper and had fun getting this up and running.

bkotyra commented 4 months ago

Hi Dan,

There are a few different topics here - I'll try to answer them separately.


Regarding test and validation data:

I prepared a small sample dataset for the measurement application and added it to the repository. It's purely illustrative (5x5 cells with two marked outlets), but I think it can be helpful in preparing data for the application and verifying that everything works as expected.

Of course I can provide you with the datasets I used for the measurements presented in the paper. But I'd rather not put them in this repository, as the files are quite large.


Regarding georeferencing info:

You're right, the application ignores georeferencing information from the input file. From the beginning, this project was focused solely on algorithmic efficiency - the source code provided in this repository is essentially designed to measure and compare the performance of different algorithms. While maintaining georeferencing would naturally be useful in actual hydrological analysis, it simply didn't seem relevant to this research.

So while I agree this change could be useful for some applications, I'd prefer to keep the code here exactly as it was when I used it to prepare the results for the publication.


Regarding issues with Whitebox data:

Unfortunately, without analyzing the dataset you used, I can only guess what the problem might be. However, I have a few comments that I hope will be helpful.

My implementations use a single-flow direction format similar to that described in the ArcGIS documentation. As you noted, this convention is rotated by one position (relative to the Whitebox format). I don't really know how Whitebox calculates flow direction (I don't have much experience with it), but I've had some surprises with exporting data from other tools in the past. I'd start searching for the problem by manually looking at the flow direction data around the locations where the discrepancies appear - especially since some of the cases you described seem to involve only a small number of cells.

In general, all five algorithms from the measurement application should produce exactly the same output as each other (for the same input data). If there are any discrepancies between the output rasters produced by different algorithms, it means that some fundamental assumption has been broken. For example, given incorrect input data (e.g. with looping flow paths), particular algorithms may "fail" in different ways.