Closed MarcusStrobl closed 5 months ago
This is an unintuitive error message, that happens in most cases when there is nothing to aggregate into the internal duckdb database. The output should include a folder datasets
. If that is not there, the clip and select step was not able to collect data.
In the logs, the Thread pool starts and directly shuts down again, which would support this. Can you confirm that the dataset of id 1100 covers the requested area and intersects with the time span?
The datasets
folder is not created.
Dataset with id 1100 is the DEM. The selected area was within the bounding box of the DEM. According to the abstract of the DEM I expect there was an intersect in the time span, though I didn't care about it, as I didn't expect the time to be relevant for the DEM.
I tried it again with dataset id 1102 (precipitation), selected area within the bounding box of dataset 1102, selected time was 15.12.2020 - 20.12.2020. (Data availability is according to our db from 1931 to 2023) .
I think we missed the mount to the data folder. That is added now, but still the same error message.
Could this be a problem with the database access?
The data loader does not yet support tifs. Could not manage to implement this so far. Sorry. I would have expected that the NotImplementedError
would occur in the container logs, though.
For the id 1102, I would expect that the datasets folder is created, as that data source is read from an internal database table. However, the merging does currently not work for the exported parquet files as reported in #2. I still plan to fix that bug after I finished my poster for EGU...
We made some progress, but now we are stuck again. The datasests folder is still not created. Is the following helpful for you to give us a hint how to continue?
Container logs:
100%|██████████| 1/1 [00:00<00:00, 11.91it/s]
[INFO]: Starting to create a consistent DuckDB dataset at /out/dataset.duckdb. Check out https://duckdb.org/docs/api/overview to learn more about DuckDB.
[INFO]: Joerg's modified version: ingestor.load_files
[INFO]: params: dataset_ids=[1102] start_date=datetime.datetime(2007, 1, 1, 0, 0) end_date=datetime.datetime(2015, 1, 1, 0, 0) integration=<Integrations.ALL: 'all'> keep_data_files=True database_name='dataset.duckdb' precision='day' res
olution=5000 cell_touches=True base_path='/out' netcdf_backend=<NetCDFBackends.XARRAY: 'xarray'>
[INFO]: file_mapping: [{'entry': <metacatalog.models.entry.Entry object at 0x7f8da0d186a0>, 'data_path': None}]
0%| | 0/1 [00:00<?, ?it/s]
[INFO]: progress bar - tqdm(file_mapping): 0%| | 0/1 [00:00<?, ?it/s]
0%| | 0/1 [00:00<?, ?it/s][INFO]: in loop: mapping: {'entry': <metacatalog.models.entry.Entry object at 0x7f8da0d186a0>, 'data_path': None}
[INFO]: entry: <ID=1102 HYRAS-DE-PRE - Raste [precipitation] >
0%| | 0/1 [00:00<?, ?it/s]
Traceback (most recent call last):
File "/src/run.py", line 135, in <module>
path = ingestor.load_files(file_mapping=file_mapping)
File "/src/ingestor.py", line 184, in load_files
data_path = Path(mapping['data_path'])
File "/usr/local/lib/python3.10/pathlib.py", line 960, in __new__
self = cls._from_parts(args)
File "/usr/local/lib/python3.10/pathlib.py", line 594, in _from_parts
drv, root, parts = self._parse_args(args)
File "/usr/local/lib/python3.10/pathlib.py", line 578, in _parse_args
a = os.fspath(a)
TypeError: expected str, bytes or os.PathLike object, not NoneType
In processing.log we now have a little more information:
Processing logs:
----------------
[2024-04-12 13:20:27,729] - [DEBUG] - [START ThreadPoolExecutor - Pool to load and clip data source files.]
[2024-04-12 13:20:27,817] - [DEBUG] - [STOP ThreadPoolExecutor - Pool finished all tasks and shutdown.]
[2024-04-12 13:20:27,817] - [INFO] - [Starting to create a consistent DuckDB dataset at /out/dataset.duckdb. Check out https://duckdb.org/docs/api/overview to learn more about DuckDB.]
[2024-04-12 13:20:27,817] - [INFO] - [Joerg's modified version: ingestor.load_files]
[2024-04-12 13:20:27,817] - [INFO] - [params: dataset_ids=[1102] start_date=datetime.datetime(2007, 1, 1, 0, 0) end_date=datetime.datetime(2015, 1, 1, 0, 0) integration=<Integrations.ALL: 'all'> keep_data_files=True database_name='dataset.duckdb' precision='day' resolution=5000 cell_touches=True base_path='/out' netcdf_backend=<NetCDFBackends.XARRAY: 'xarray'>]
[2024-04-12 13:20:27,817] - [INFO] - [file_mapping: [{'entry': <metacatalog.models.entry.Entry object at 0x7f8da0d186a0>, 'data_path': None}]]
[2024-04-12 13:20:27,817] - [INFO] - [progress bar - tqdm(file_mapping): 0%| | 0/1 [00:00<?, ?it/s]]
[2024-04-12 13:20:27,818] - [INFO] - [in loop: mapping: {'entry': <metacatalog.models.entry.Entry object at 0x7f8da0d186a0>, 'data_path': None}]
[2024-04-12 13:20:27,820] - [INFO] - [entry: <ID=1102 HYRAS-DE-PRE - Raste [precipitation] >]
We would expect the problem is now related with the 'data_path': None
, but not sure where this comes from.
More progress, but new errors. Now we have an output folder that looks like this:
-rw-r--r-- 1 root root 29542330368 Apr 16 17:40 dataset.duckdb
drwxr-xr-x 2 root root 6 Apr 16 17:40 datasets
-rw-r--r-- 1 root root 777 Apr 16 17:27 errors.log
-rw-r--r-- 1 root root 41257 Apr 16 17:40 processing.log
-rw-r--r-- 1 root root 71982 Apr 16 17:26 reference_area.ascii
-rw-r--r-- 1 root root 96230 Apr 16 17:26 reference_area.geojson
Though, the datasets folder is empty. Note the huge dataset.duckdb
processing.log
:
Version: jm_demo 0.3
The following information has been submitted to the tool:
START DATE: 1950-01-01 12:00:00+01:00
END DATE: 2020-12-31 12:00:00+01:00
REFERENCE AREA: True
INTEGRATION: all
KEEP DATA FILES: True
DATASET IDS:
1106
DATABASE CONNECTION: True
DATABASE URI: Engine(postgresql://postgres@10.88.0.1:5432/metacatalog-dev)
AGGREGATION SETTINGS
--------------------
PRECISION: day
RESOLUTION: 5000x5000
TARGET CRS: EPSG:3857
Processing logs:
----------------
[2024-04-16 15:26:56,787] - [DEBUG] - [START ThreadPoolExecutor - Pool to load and clip data source files.]
[2024-04-16 15:26:56,886] - [WARNING] - [data_path set to /data/qt7760/hyras/TemperatureMean]
[2024-04-16 15:26:56,886] - [DEBUG] - [STOP ThreadPoolExecutor - Pool finished all tasks and shutdown.]
[2024-04-16 15:26:56,886] - [INFO] - [Starting to create a consistent DuckDB dataset at /out/dataset.duckdb. Check out https://duckdb.org/docs/api/overview to learn more about DuckDB.]
[2024-04-16 15:26:56,886] - [INFO] - [Joerg's modified version: ingestor.load_files]
[2024-04-16 15:26:56,886] - [INFO] - [params: dataset_ids=[1106] start_date=datetime.datetime(1950, 1, 1, 12, 0, tzinfo=tzoffset(None, 3600)) end_date=datetime.datetime(2020, 12, 31, 12, 0, tzinfo=tzoffset(None, 3600)) integration=<Integrations.ALL: 'all'> keep_data_files=True database_name='dataset.duckdb' precision='day' resolution=5000 cell_touches=True base_path='/out' netcdf_backend=<NetCDFBackends.XARRAY: 'xarray'>]
[2024-04-16 15:26:56,886] - [INFO] - [file_mapping: [{'entry': <metacatalog.models.entry.Entry object at 0x7fa80f7b4610>, 'data_path': '/data/qt7760/hyras/TemperatureMean'}]]
[2024-04-16 15:26:56,887] - [INFO] - [progress bar - tqdm(file_mapping): 0%| | 0/1 [00:00<?, ?it/s]]
[2024-04-16 15:26:56,887] - [INFO] - [in loop: mapping: {'entry': <metacatalog.models.entry.Entry object at 0x7fa80f7b4610>, 'data_path': '/data/qt7760/hyras/TemperatureMean'}]
[2024-04-16 15:26:56,890] - [INFO] - [entry: <ID=1106 HYRAS-DE-TAS - Raste [air temperature] >]
[2024-04-16 15:26:56,890] - [DEBUG] - [data_path: /data/qt7760/hyras/TemperatureMean]
[2024-04-16 15:26:56,967] - [INFO] - [Loading preprocessed source <ID=1106> to duckdb database </out/dataset.duckdb> for data integration...]
[2024-04-16 15:26:59,125] - [INFO] - [duckdb /out/dataset.duckdb -c "CREATE TABLE air_temperature_1106 ( time TIMESTAMP, lon DOUBLE, lat DOUBLE, tas DOUBLE);"]
[2024-04-16 15:27:15,087] - [INFO] - [python - dfs = [data.to_dask_dataframe()[['time', 'lon', 'lat', 'tas']].partitions[i].compute() for i in range(1)]]
[2024-04-16 15:27:15,149] - [ERROR] - [ERRORED on loading file </data/qt7760/hyras/TemperatureMean/tas_hyras_5_1993_v5-0_de.nc>]
Traceback (most recent call last):
File "/src/ingestor.py", line 198, in load_files
table_name = _switch_source_loader(entry, fname)
File "/src/ingestor.py", line 244, in _switch_source_loader
return load_xarray_to_duckdb(entry, ds)
File "/src/ingestor.py", line 288, in load_xarray_to_duckdb
db.execute(sql)
duckdb.InvalidInputException: Invalid Input Error: Required module 'pandas.core.arrays.arrow.dtype' failed to import, due to the following Python exception:
ModuleNotFoundError: No module named 'pandas.core.arrays.arrow.dtype'
[2024-04-16 15:27:15,189] - [INFO] - [Loading preprocessed source <ID=1106> to duckdb database </out/dataset.duckdb> for data integration...]
[2024-04-16 15:27:19,789] - [INFO] - [python - dfs = [data.to_dask_dataframe()[['time', 'lon', 'lat', 'tas']].partitions[i].compute() for i in range(1)]]
[2024-04-16 15:27:25,047] - [INFO] - [duckdb - FOREACH df in dfs - INSERT INTO air_temperature_1106 SELECT time as time , lon AS lon, lat AS lat, tas FROM df;]
[2024-04-16 15:27:25,535] - [INFO] - [took 10.35 seconds]
[2024-04-16 15:27:25,574] - [INFO] - [Loading preprocessed source <ID=1106> to duckdb database </out/dataset.duckdb> for data integration...]
[2024-04-16 15:27:30,153] - [INFO] - [python - dfs = [data.to_dask_dataframe()[['time', 'lon', 'lat', 'tas']].partitions[i].compute() for i in range(1)]]
[2024-04-16 15:27:35,994] - [INFO] - [duckdb - FOREACH df in dfs - INSERT INTO air_temperature_1106 SELECT time as time , lon AS lon, lat AS lat, tas FROM df;]
[2024-04-16 15:27:36,487] - [INFO] - [took 10.91 seconds]
[2024-04-16 15:27:36,523] - [INFO] - [Loading preprocessed source <ID=1106> to duckdb database </out/dataset.duckdb> for data integration...]
[2024-04-16 15:27:42,683] - [INFO] - [python - dfs = [data.to_dask_dataframe()[['time', 'lon', 'lat', 'tas']].partitions[i].com
...
<this goes on for a while>
...
[2024-04-16 15:40:12,292] - [INFO] - [Loading preprocessed source <ID=1106> to duckdb database </out/dataset.duckdb> for data integration...]
[2024-04-16 15:40:17,609] - [INFO] - [python - dfs = [data.to_dask_dataframe()[['time', 'lon', 'lat', 'tas']].partitions[i].compute() for i in range(1)]]
[2024-04-16 15:40:24,778] - [INFO] - [duckdb - FOREACH df in dfs - INSERT INTO air_temperature_1106 SELECT time as time , lon AS lon, lat AS lat, tas FROM df;]
[2024-04-16 15:40:25,382] - [INFO] - [took 13.09 seconds]
[2024-04-16 15:40:25,421] - [INFO] - [Loading preprocessed source <ID=1106> to duckdb database </out/dataset.duckdb> for data integration...]
[2024-04-16 15:40:31,257] - [INFO] - [python - dfs = [data.to_dask_dataframe()[['time', 'lon', 'lat', 'tas']].partitions[i].compute() for i in range(1)]]
[2024-04-16 15:40:38,442] - [INFO] - [duckdb - FOREACH df in dfs - INSERT INTO air_temperature_1106 SELECT time as time , lon AS lon, lat AS lat, tas FROM df;]
[2024-04-16 15:40:39,119] - [INFO] - [took 13.70 seconds]
[2024-04-16 15:40:39,125] - [INFO] - [duckdb - CREATE MACRO air_temperature_1106_temporal_aggregate(precision) AS TABLE SELECT date_trunc(precision, time) AS time, AVG(tas) AS mean, STDDEV(tas) AS std, KURTOSIS(tas) AS kurtosis, SKEWNESS(tas) AS skewness, MEDIAN(tas) AS median, MIN(tas) AS min, MAX(tas) AS max, SUM(tas) AS sum, COUNT(tas) AS count, quantile_disc(tas, 0.25) as quartile_25, quantile_disc(tas, 0.75) as quartile_75, entropy(tas) as entropy, histogram(tas) as histogram FROM air_temperature_1106 GROUP BY date_trunc(precision, time);]
[2024-04-16 15:40:40,164] - [INFO] - [duckdb - CREATE MACRO air_temperature_1106_spatial_aggregate(resolution) AS TABLE WITH t as (SELECT ST_Transform(ST_Point(lon, lat), 'epsg:4326', 'epsg:3857') as geom, tas FROM air_temperature_1106) SELECT ROUND(ST_Y(geom) / resolution)::int * resolution AS y, ROUND(ST_X(geom) / resolution)::int * resolution AS x, AVG(tas) AS mean, STDDEV(tas) AS std, KURTOSIS(tas) AS kurtosis, SKEWNESS(tas) AS skewness, MEDIAN(tas) AS median, MIN(tas) AS min, MAX(tas) AS max, SUM(tas) AS sum, COUNT(tas) AS count, quantile_disc(tas, 0.25) as quartile_25, quantile_disc(tas, 0.75) as quartile_75, entropy(tas) as entropy, histogram(tas) as histogram FROM t GROUP BY x, y;]
[2024-04-16 15:40:41,230] - [INFO] - [duckdb - CREATE MACRO air_temperature_1106_spatiotemporal_aggregate(resolution, precision) AS TABLE WITH t as (SELECT date_trunc(precision, time) AS time, ST_Transform(ST_Point(lon, lat), 'epsg:4326', 'epsg:3857') as geom, tas FROM air_temperature_1106) SELECT time, ROUND(ST_Y(geom) / resolution)::int * resolution AS y, ROUND(ST_X(geom) / resolution)::int * resolution AS x, AVG(tas) AS mean, STDDEV(tas) AS std, KURTOSIS(tas) AS kurtosis, SKEWNESS(tas) AS skewness, MEDIAN(tas) AS median, MIN(tas) AS min, MAX(tas) AS max, SUM(tas) AS sum, COUNT(tas) AS count, quantile_disc(tas, 0.25) as quartile_25, quantile_disc(tas, 0.75) as quartile_75, entropy(tas) as entropy, histogram(tas) as histogram FROM t GROUP BY time, x, y;]
[2024-04-16 15:40:42,350] - [DEBUG] - [calling load_metadata_to_duckdb()...]
[2024-04-16 15:40:44,754] - [DEBUG] - [Database /out/dataset.duckdb does not contain a table 'metadata'. Creating it now...]
errors.log
:
[2024-04-16 15:26:56,886] - [WARNING] - [data_path set to /data/qt7760/hyras/TemperatureMean]
[2024-04-16 15:27:15,149] - [ERROR] - [ERRORED on loading file </data/qt7760/hyras/TemperatureMean/tas_hyras_5_1993_v5-0_de.nc>]
Traceback (most recent call last):
File "/src/ingestor.py", line 198, in load_files
table_name = _switch_source_loader(entry, fname)
File "/src/ingestor.py", line 244, in _switch_source_loader
return load_xarray_to_duckdb(entry, ds)
File "/src/ingestor.py", line 288, in load_xarray_to_duckdb
db.execute(sql)
duckdb.InvalidInputException: Invalid Input Error: Required module 'pandas.core.arrays.arrow.dtype' failed to import, due to the following Python exception:
ModuleNotFoundError: No module named 'pandas.core.arrays.arrow.dtype'
Does someone has any ideas how to solve this?
I can start the process and some output is written, but it seems something wents wrong with
os.path
within the process. Is this a known issue? Or am I maybe using an outdated version of tool_vforwater_loader? Container logs:My Input is:
Output is:
Content of
processing.log
: