Closed p1d1d1 closed 2 months ago
@FStriewski @314a can u please provide a feedback here?
@eliaferrari We are using the manually processed merged_data.pkl from temp_data - could it be that it was created with an incomplete source.csv?
@eliaferrari We are using the manually processed merged_data.pkl from temp_data - could it be that it was created with an incomplete source.csv?
@FStriewski the manually processed dataframe contains all the data, Bund, GE, GL, GR, JU, SO, SZ, TI, VD, UR, ZG, ZH included (https://github.com/FHNW-IVGI/Geoharvester/blob/main/scraper/temp_data/merged_data.pkl). What is missing are the ST_* data, I'll check what the problem is.
You have to replace the old dataframe with this one, otherwise it won't be read in this temporary path with the current code. As an alternative we can temporary change the path here: https://github.com/FHNW-IVGI/Geoharvester/blob/2f6926e9e45644f7a52d35255b4093c05ad6f79a/server/app/main.py#L78
@eliaferrari You're right, I had copied the merged_data.pkl to the data folder on production but the change wasn't picked up. I've adjusted the path now so that we use the temp_data folder for now - also in case that we need further manual preprocessing.
Running the scraper now. If you need to test your commits, you can trigger the scraper yourself from this page: https://github.com/FHNW-IVGI/Geoharvester/actions/workflows/run_scraper.yml Klick "Run Workflow" dropdown, check that "main" is selected and click the "Run workflow" Button. You can click on the title of a row to see the pipeline and logs.
@p1d1d1 Providers should be restored now.
@FStriewski ST_ZH, ST_BE,ASIT and SOSM are empty
@eliaferrari they are indeed missing in the merged_data.pkl but are present in the sources.csv. Was the latest sources.csv file used for the manual run?
Indeed the one that I used do not contain ST_ZH, ST_BE, ASIT and SOSM data: https://github.com/FHNW-IVGI/Geoharvester/blob/main/scraper/data/geoservices_CH.csv @FStriewski do you know where I can find a newer version?
This one is the newest version (and the path that is used for the pipline scraper: https://github.com/FHNW-IVGI/Geoharvester/blob/main/scraper/sources.csv
sources.csv contains all the GetCapabilities links. Are the scraped metadata from those links saved somewhere else? The geoservices_CH.csv file is 9 months old.
Providers (ASIT, SOSM, ST_ZH etc) are back online. ST_BE was missing in sources and wasn't scraped. I've added it back in and should be available after the next run.
BUND, GE, GL, GR, JU, SO, SZ, TI, VD, UR, ZG, ZH, ST_SH, ST_BE are all empty. Please fix this with high priority.