FHNW-IVGI / Geoharvester

NDGI Project Geoharvester
13 stars 1 forks source link

Many provider are empty #113

Closed p1d1d1 closed 2 months ago

p1d1d1 commented 2 months ago

BUND, GE, GL, GR, JU, SO, SZ, TI, VD, UR, ZG, ZH, ST_SH, ST_BE are all empty. Please fix this with high priority.

p1d1d1 commented 2 months ago

@FStriewski @314a can u please provide a feedback here?

FStriewski commented 2 months ago

@eliaferrari We are using the manually processed merged_data.pkl from temp_data - could it be that it was created with an incomplete source.csv?

eliaferrari commented 2 months ago

@eliaferrari We are using the manually processed merged_data.pkl from temp_data - could it be that it was created with an incomplete source.csv?

@FStriewski the manually processed dataframe contains all the data, Bund, GE, GL, GR, JU, SO, SZ, TI, VD, UR, ZG, ZH included (https://github.com/FHNW-IVGI/Geoharvester/blob/main/scraper/temp_data/merged_data.pkl). What is missing are the ST_* data, I'll check what the problem is.

You have to replace the old dataframe with this one, otherwise it won't be read in this temporary path with the current code. As an alternative we can temporary change the path here: https://github.com/FHNW-IVGI/Geoharvester/blob/2f6926e9e45644f7a52d35255b4093c05ad6f79a/server/app/main.py#L78

FStriewski commented 2 months ago

@eliaferrari You're right, I had copied the merged_data.pkl to the data folder on production but the change wasn't picked up. I've adjusted the path now so that we use the temp_data folder for now - also in case that we need further manual preprocessing.

Running the scraper now. If you need to test your commits, you can trigger the scraper yourself from this page: https://github.com/FHNW-IVGI/Geoharvester/actions/workflows/run_scraper.yml Klick "Run Workflow" dropdown, check that "main" is selected and click the "Run workflow" Button. You can click on the title of a row to see the pipeline and logs.

@p1d1d1 Providers should be restored now.

p1d1d1 commented 2 months ago

@FStriewski ST_ZH, ST_BE,ASIT and SOSM are empty

FStriewski commented 2 months ago

@eliaferrari they are indeed missing in the merged_data.pkl but are present in the sources.csv. Was the latest sources.csv file used for the manual run?

eliaferrari commented 2 months ago

Indeed the one that I used do not contain ST_ZH, ST_BE, ASIT and SOSM data: https://github.com/FHNW-IVGI/Geoharvester/blob/main/scraper/data/geoservices_CH.csv @FStriewski do you know where I can find a newer version?

FStriewski commented 2 months ago

This one is the newest version (and the path that is used for the pipline scraper: https://github.com/FHNW-IVGI/Geoharvester/blob/main/scraper/sources.csv

eliaferrari commented 2 months ago

sources.csv contains all the GetCapabilities links. Are the scraped metadata from those links saved somewhere else? The geoservices_CH.csv file is 9 months old.

FStriewski commented 2 months ago

Providers (ASIT, SOSM, ST_ZH etc) are back online. ST_BE was missing in sources and wasn't scraped. I've added it back in and should be available after the next run.