Test run on server - Githubissues

I can start the process and some output is written, but it seems something wents wrong with os.path within the process. Is this a known issue? Or am I maybe using an outdated version of tool_vforwater_loader? Container logs:

100%|██████████| 1/1 [00:00<00:00,  4.30it/s]
[INFO]: Starting to create a consistent DuckDB dataset at /out/dataset.db. Check out https://duckdb.org/docs/api/overview to learn more about DuckDB.
  0%|          | 0/1 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/src/run.py", line 135, in <module>
    path = ingestor.load_files(file_mapping=file_mapping)
  File "/src/ingestor.py", line 174, in load_files
    data_path = Path(mapping['data_path'])
  File "/usr/local/lib/python3.10/pathlib.py", line 960, in __new__
    self = cls._from_parts(args)
  File "/usr/local/lib/python3.10/pathlib.py", line 594, in _from_parts
    drv, root, parts = self._parse_args(args)
  File "/usr/local/lib/python3.10/pathlib.py", line 578, in _parse_args
    a = os.fspath(a)
TypeError: expected str, bytes or os.PathLike object, not NoneType

My Input is:

{
    "vforwater_loader": {
        "parameters": {
            "dataset_ids": [
                1100
            ],
            "start_date": "2015-01-01T00:00",
            "end_date": "2024-01-01T01:00",
            "reference_area": {
                "type": "Feature",
                "geometry": {
                    "type": "Polygon",
                    "coordinates": [[[7.705203808652068,49.372078876590024],
                            [8.664239222099754,49.372078876590024],
                            [8.664239222099754,49.94075365534633],
                            [7.705203808652068,49.94075365534633],
                            [7.705203808652068,49.372078876590024]]]
                },
                "properties": {
                    "name": "Map drawing",
                    "orgID": "selectArea716047"
                }
            }
        }
    }
}

Output is:

-rw-r--r-- 1 root root   0 Apr  8 09:33 errors.log
-rw-r--r-- 1 root root 940 Apr  8 09:33 processing.log
-rw-r--r-- 1 root root 183 Apr  8 09:33 reference_area.ascii
-rw-r--r-- 1 root root 414 Apr  8 09:33 reference_area.geojson

Content of processing.log:

This is the V-FOR-WaTer data loader report

The following information has been submitted to the tool:

START DATE:         2015-01-01 00:00:00
END DATE:           2024-01-01 01:00:00
REFERENCE AREA:     True
INTEGRATION:        all
KEEP DATA FILES:    True

DATASET IDS:
1100

DATABASE CONNECTION: True
DATABASE URI:        Engine(postgresql://postgres@localhost:5432/metacatalog-dev)

AGGREGATION SETTINGS
--------------------
PRECISION:          day
RESOLUTION:         5000x5000
TARGET CRS:         EPSG:3857

Processing logs:
----------------
[2024-04-08 07:33:39,884] - [DEBUG] - [START ThreadPoolExecutor - Pool to load and clip data source files.]
[2024-04-08 07:33:40,120] - [DEBUG] - [STOP ThreadPoolExecutor - Pool finished all tasks and shutdown.]
[2024-04-08 07:33:40,121] - [INFO] - [Starting to create a consistent DuckDB dataset at /out/dataset.db. Check out https://duckdb.org/docs/api/overview to learn more about DuckDB.]

This is an unintuitive error message, that happens in most cases when there is nothing to aggregate into the internal duckdb database. The output should include a folder datasets. If that is not there, the clip and select step was not able to collect data. In the logs, the Thread pool starts and directly shuts down again, which would support this. Can you confirm that the dataset of id 1100 covers the requested area and intersects with the time span?

The datasets folder is not created. Dataset with id 1100 is the DEM. The selected area was within the bounding box of the DEM. According to the abstract of the DEM I expect there was an intersect in the time span, though I didn't care about it, as I didn't expect the time to be relevant for the DEM.

I tried it again with dataset id 1102 (precipitation), selected area within the bounding box of dataset 1102, selected time was 15.12.2020 - 20.12.2020. (Data availability is according to our db from 1931 to 2023) .

I think we missed the mount to the data folder. That is added now, but still the same error message.

Could this be a problem with the database access?

The data loader does not yet support tifs. Could not manage to implement this so far. Sorry. I would have expected that the NotImplementedError would occur in the container logs, though.

For the id 1102, I would expect that the datasets folder is created, as that data source is read from an internal database table. However, the merging does currently not work for the exported parquet files as reported in #2. I still plan to fix that bug after I finished my poster for EGU...

We made some progress, but now we are stuck again. The datasests folder is still not created. Is the following helpful for you to give us a hint how to continue?

Container logs:

100%|██████████| 1/1 [00:00<00:00, 11.91it/s]
[INFO]: Starting to create a consistent DuckDB dataset at /out/dataset.duckdb. Check out https://duckdb.org/docs/api/overview to learn more about DuckDB.
[INFO]: Joerg's modified version: ingestor.load_files
[INFO]: params: dataset_ids=[1102] start_date=datetime.datetime(2007, 1, 1, 0, 0) end_date=datetime.datetime(2015, 1, 1, 0, 0) integration=<Integrations.ALL: 'all'> keep_data_files=True database_name='dataset.duckdb' precision='day' res
olution=5000 cell_touches=True base_path='/out' netcdf_backend=<NetCDFBackends.XARRAY: 'xarray'>
[INFO]: file_mapping: [{'entry': <metacatalog.models.entry.Entry object at 0x7f8da0d186a0>, 'data_path': None}]
 0%|          | 0/1 [00:00<?, ?it/s]
[INFO]: progress bar - tqdm(file_mapping):   0%|          | 0/1 [00:00<?, ?it/s]
 0%|          | 0/1 [00:00<?, ?it/s][INFO]: in loop: mapping: {'entry': <metacatalog.models.entry.Entry object at 0x7f8da0d186a0>, 'data_path': None}
[INFO]: entry: <ID=1102 HYRAS-DE-PRE - Raste [precipitation] >
 0%|          | 0/1 [00:00<?, ?it/s]
Traceback (most recent call last):
 File "/src/run.py", line 135, in <module>
   path = ingestor.load_files(file_mapping=file_mapping)
 File "/src/ingestor.py", line 184, in load_files
   data_path = Path(mapping['data_path'])
 File "/usr/local/lib/python3.10/pathlib.py", line 960, in __new__
   self = cls._from_parts(args)
 File "/usr/local/lib/python3.10/pathlib.py", line 594, in _from_parts
   drv, root, parts = self._parse_args(args)
 File "/usr/local/lib/python3.10/pathlib.py", line 578, in _parse_args
   a = os.fspath(a)
TypeError: expected str, bytes or os.PathLike object, not NoneType

In processing.log we now have a little more information:

Processing logs:
----------------
[2024-04-12 13:20:27,729] - [DEBUG] - [START ThreadPoolExecutor - Pool to load and clip data source files.]
[2024-04-12 13:20:27,817] - [DEBUG] - [STOP ThreadPoolExecutor - Pool finished all tasks and shutdown.]
[2024-04-12 13:20:27,817] - [INFO] - [Starting to create a consistent DuckDB dataset at /out/dataset.duckdb. Check out https://duckdb.org/docs/api/overview to learn more about DuckDB.]
[2024-04-12 13:20:27,817] - [INFO] - [Joerg's modified version: ingestor.load_files]
[2024-04-12 13:20:27,817] - [INFO] - [params: dataset_ids=[1102] start_date=datetime.datetime(2007, 1, 1, 0, 0) end_date=datetime.datetime(2015, 1, 1, 0, 0) integration=<Integrations.ALL: 'all'> keep_data_files=True database_name='dataset.duckdb' precision='day' resolution=5000 cell_touches=True base_path='/out' netcdf_backend=<NetCDFBackends.XARRAY: 'xarray'>]
[2024-04-12 13:20:27,817] - [INFO] - [file_mapping: [{'entry': <metacatalog.models.entry.Entry object at 0x7f8da0d186a0>, 'data_path': None}]]
[2024-04-12 13:20:27,817] - [INFO] - [progress bar - tqdm(file_mapping):   0%|          | 0/1 [00:00<?, ?it/s]]
[2024-04-12 13:20:27,818] - [INFO] - [in loop: mapping: {'entry': <metacatalog.models.entry.Entry object at 0x7f8da0d186a0>, 'data_path': None}]
[2024-04-12 13:20:27,820] - [INFO] - [entry: <ID=1102 HYRAS-DE-PRE - Raste [precipitation] >]

We would expect the problem is now related with the 'data_path': None, but not sure where this comes from.

More progress, but new errors. Now we have an output folder that looks like this:

-rw-r--r-- 1 root root 29542330368 Apr 16 17:40 dataset.duckdb
drwxr-xr-x 2 root root           6 Apr 16 17:40 datasets
-rw-r--r-- 1 root root         777 Apr 16 17:27 errors.log
-rw-r--r-- 1 root root       41257 Apr 16 17:40 processing.log
-rw-r--r-- 1 root root       71982 Apr 16 17:26 reference_area.ascii
-rw-r--r-- 1 root root       96230 Apr 16 17:26 reference_area.geojson

Though, the datasets folder is empty. Note the huge dataset.duckdb

processing.log:

Version: jm_demo 0.3

The following information has been submitted to the tool:

START DATE:         1950-01-01 12:00:00+01:00
END DATE:           2020-12-31 12:00:00+01:00
REFERENCE AREA:     True
INTEGRATION:        all
KEEP DATA FILES:    True

DATASET IDS:
1106

DATABASE CONNECTION: True
DATABASE URI:        Engine(postgresql://postgres@10.88.0.1:5432/metacatalog-dev)

AGGREGATION SETTINGS
--------------------
PRECISION:          day
RESOLUTION:         5000x5000
TARGET CRS:         EPSG:3857

Processing logs:
----------------
[2024-04-16 15:26:56,787] - [DEBUG] - [START ThreadPoolExecutor - Pool to load and clip data source files.]
[2024-04-16 15:26:56,886] - [WARNING] - [data_path set to /data/qt7760/hyras/TemperatureMean]
[2024-04-16 15:26:56,886] - [DEBUG] - [STOP ThreadPoolExecutor - Pool finished all tasks and shutdown.]
[2024-04-16 15:26:56,886] - [INFO] - [Starting to create a consistent DuckDB dataset at /out/dataset.duckdb. Check out https://duckdb.org/docs/api/overview to learn more about DuckDB.]
[2024-04-16 15:26:56,886] - [INFO] - [Joerg's modified version: ingestor.load_files]
[2024-04-16 15:26:56,886] - [INFO] - [params: dataset_ids=[1106] start_date=datetime.datetime(1950, 1, 1, 12, 0, tzinfo=tzoffset(None, 3600)) end_date=datetime.datetime(2020, 12, 31, 12, 0, tzinfo=tzoffset(None, 3600)) integration=<Integrations.ALL: 'all'> keep_data_files=True database_name='dataset.duckdb' precision='day' resolution=5000 cell_touches=True base_path='/out' netcdf_backend=<NetCDFBackends.XARRAY: 'xarray'>]
[2024-04-16 15:26:56,886] - [INFO] - [file_mapping: [{'entry': <metacatalog.models.entry.Entry object at 0x7fa80f7b4610>, 'data_path': '/data/qt7760/hyras/TemperatureMean'}]]
[2024-04-16 15:26:56,887] - [INFO] - [progress bar - tqdm(file_mapping):   0%|          | 0/1 [00:00<?, ?it/s]]
[2024-04-16 15:26:56,887] - [INFO] - [in loop: mapping: {'entry': <metacatalog.models.entry.Entry object at 0x7fa80f7b4610>, 'data_path': '/data/qt7760/hyras/TemperatureMean'}]
[2024-04-16 15:26:56,890] - [INFO] - [entry: <ID=1106 HYRAS-DE-TAS - Raste [air temperature] >]
[2024-04-16 15:26:56,890] - [DEBUG] - [data_path: /data/qt7760/hyras/TemperatureMean]
[2024-04-16 15:26:56,967] - [INFO] - [Loading preprocessed source <ID=1106> to duckdb database </out/dataset.duckdb> for data integration...]
[2024-04-16 15:26:59,125] - [INFO] - [duckdb /out/dataset.duckdb -c "CREATE TABLE air_temperature_1106 ( time TIMESTAMP,  lon DOUBLE, lat DOUBLE,  tas DOUBLE);"]
[2024-04-16 15:27:15,087] - [INFO] - [python - dfs = [data.to_dask_dataframe()[['time', 'lon', 'lat', 'tas']].partitions[i].compute() for i in range(1)]]
[2024-04-16 15:27:15,149] - [ERROR] - [ERRORED on loading file </data/qt7760/hyras/TemperatureMean/tas_hyras_5_1993_v5-0_de.nc>]
Traceback (most recent call last):
  File "/src/ingestor.py", line 198, in load_files
    table_name = _switch_source_loader(entry, fname)
  File "/src/ingestor.py", line 244, in _switch_source_loader
    return load_xarray_to_duckdb(entry, ds)
  File "/src/ingestor.py", line 288, in load_xarray_to_duckdb
    db.execute(sql)
duckdb.InvalidInputException: Invalid Input Error: Required module 'pandas.core.arrays.arrow.dtype' failed to import, due to the following Python exception:
ModuleNotFoundError: No module named 'pandas.core.arrays.arrow.dtype'
[2024-04-16 15:27:15,189] - [INFO] - [Loading preprocessed source <ID=1106> to duckdb database </out/dataset.duckdb> for data integration...]
[2024-04-16 15:27:19,789] - [INFO] - [python - dfs = [data.to_dask_dataframe()[['time', 'lon', 'lat', 'tas']].partitions[i].compute() for i in range(1)]]
[2024-04-16 15:27:25,047] - [INFO] - [duckdb - FOREACH df in dfs - INSERT INTO air_temperature_1106 SELECT   time as time , lon AS lon, lat AS lat, tas  FROM df;]
[2024-04-16 15:27:25,535] - [INFO] - [took 10.35 seconds]
[2024-04-16 15:27:25,574] - [INFO] - [Loading preprocessed source <ID=1106> to duckdb database </out/dataset.duckdb> for data integration...]
[2024-04-16 15:27:30,153] - [INFO] - [python - dfs = [data.to_dask_dataframe()[['time', 'lon', 'lat', 'tas']].partitions[i].compute() for i in range(1)]]
[2024-04-16 15:27:35,994] - [INFO] - [duckdb - FOREACH df in dfs - INSERT INTO air_temperature_1106 SELECT   time as time , lon AS lon, lat AS lat, tas  FROM df;]
[2024-04-16 15:27:36,487] - [INFO] - [took 10.91 seconds]
[2024-04-16 15:27:36,523] - [INFO] - [Loading preprocessed source <ID=1106> to duckdb database </out/dataset.duckdb> for data integration...]
[2024-04-16 15:27:42,683] - [INFO] - [python - dfs = [data.to_dask_dataframe()[['time', 'lon', 'lat', 'tas']].partitions[i].com
...
<this goes on for a while>
...
[2024-04-16 15:40:12,292] - [INFO] - [Loading preprocessed source <ID=1106> to duckdb database </out/dataset.duckdb> for data integration...]
[2024-04-16 15:40:17,609] - [INFO] - [python - dfs = [data.to_dask_dataframe()[['time', 'lon', 'lat', 'tas']].partitions[i].compute() for i in range(1)]]
[2024-04-16 15:40:24,778] - [INFO] - [duckdb - FOREACH df in dfs - INSERT INTO air_temperature_1106 SELECT   time as time , lon AS lon, lat AS lat, tas  FROM df;]
[2024-04-16 15:40:25,382] - [INFO] - [took 13.09 seconds]
[2024-04-16 15:40:25,421] - [INFO] - [Loading preprocessed source <ID=1106> to duckdb database </out/dataset.duckdb> for data integration...]
[2024-04-16 15:40:31,257] - [INFO] - [python - dfs = [data.to_dask_dataframe()[['time', 'lon', 'lat', 'tas']].partitions[i].compute() for i in range(1)]]
[2024-04-16 15:40:38,442] - [INFO] - [duckdb - FOREACH df in dfs - INSERT INTO air_temperature_1106 SELECT   time as time , lon AS lon, lat AS lat, tas  FROM df;]
[2024-04-16 15:40:39,119] - [INFO] - [took 13.70 seconds]
[2024-04-16 15:40:39,125] - [INFO] - [duckdb - CREATE MACRO air_temperature_1106_temporal_aggregate(precision) AS TABLE SELECT date_trunc(precision, time) AS time, AVG(tas) AS mean, STDDEV(tas) AS std, KURTOSIS(tas) AS kurtosis, SKEWNESS(tas) AS skewness, MEDIAN(tas) AS median, MIN(tas) AS min, MAX(tas) AS max, SUM(tas) AS sum, COUNT(tas) AS count, quantile_disc(tas, 0.25) as quartile_25, quantile_disc(tas, 0.75) as quartile_75, entropy(tas) as entropy, histogram(tas) as histogram FROM air_temperature_1106 GROUP BY date_trunc(precision, time);]
[2024-04-16 15:40:40,164] - [INFO] - [duckdb - CREATE MACRO air_temperature_1106_spatial_aggregate(resolution) AS TABLE WITH t as (SELECT  ST_Transform(ST_Point(lon, lat), 'epsg:4326', 'epsg:3857') as geom, tas FROM air_temperature_1106) SELECT ROUND(ST_Y(geom) / resolution)::int * resolution AS y, ROUND(ST_X(geom) / resolution)::int * resolution AS x, AVG(tas) AS mean, STDDEV(tas) AS std, KURTOSIS(tas) AS kurtosis, SKEWNESS(tas) AS skewness, MEDIAN(tas) AS median, MIN(tas) AS min, MAX(tas) AS max, SUM(tas) AS sum, COUNT(tas) AS count, quantile_disc(tas, 0.25) as quartile_25, quantile_disc(tas, 0.75) as quartile_75, entropy(tas) as entropy, histogram(tas) as histogram FROM t GROUP BY x, y;]
[2024-04-16 15:40:41,230] - [INFO] - [duckdb - CREATE MACRO air_temperature_1106_spatiotemporal_aggregate(resolution, precision) AS TABLE WITH t as (SELECT date_trunc(precision, time) AS time,  ST_Transform(ST_Point(lon, lat), 'epsg:4326', 'epsg:3857') as geom, tas FROM air_temperature_1106) SELECT time, ROUND(ST_Y(geom) / resolution)::int * resolution AS y, ROUND(ST_X(geom) / resolution)::int * resolution AS x, AVG(tas) AS mean, STDDEV(tas) AS std, KURTOSIS(tas) AS kurtosis, SKEWNESS(tas) AS skewness, MEDIAN(tas) AS median, MIN(tas) AS min, MAX(tas) AS max, SUM(tas) AS sum, COUNT(tas) AS count, quantile_disc(tas, 0.25) as quartile_25, quantile_disc(tas, 0.75) as quartile_75, entropy(tas) as entropy, histogram(tas) as histogram FROM t GROUP BY time, x, y;]
[2024-04-16 15:40:42,350] - [DEBUG] - [calling load_metadata_to_duckdb()...]
[2024-04-16 15:40:44,754] - [DEBUG] - [Database /out/dataset.duckdb does not contain a table 'metadata'. Creating it now...]

errors.log:

[2024-04-16 15:26:56,886] - [WARNING] - [data_path set to /data/qt7760/hyras/TemperatureMean]
[2024-04-16 15:27:15,149] - [ERROR] - [ERRORED on loading file </data/qt7760/hyras/TemperatureMean/tas_hyras_5_1993_v5-0_de.nc>]
Traceback (most recent call last):
  File "/src/ingestor.py", line 198, in load_files
    table_name = _switch_source_loader(entry, fname)
  File "/src/ingestor.py", line 244, in _switch_source_loader
    return load_xarray_to_duckdb(entry, ds)
  File "/src/ingestor.py", line 288, in load_xarray_to_duckdb
    db.execute(sql)
duckdb.InvalidInputException: Invalid Input Error: Required module 'pandas.core.arrays.arrow.dtype' failed to import, due to the following Python exception:
ModuleNotFoundError: No module named 'pandas.core.arrays.arrow.dtype'

Does someone has any ideas how to solve this?

VForWaTer / tool_vforwater_loader

Test run on server #5