ec-doris / kohesio-frontend

Web interface for Kohesio project
https://kohesio.ec.europa.eu
8 stars 0 forks source link

Automatic daily CSV/XLSX exports on test #548

Closed madewild closed 2 years ago

madewild commented 2 years ago

Currently all the exports are sitting on an EBS volume:

https://data.linkedopendata.eu/kohesio/projects/ https://dev.data.linkedopendata.eu/kohesio/projects/

It would be better to put them on S3 but we want to keep a neutral URL, not use containing "aws" or "s3". For instance the latest export of Austria should always be on https://data.linkedopendata.eu/kohesio/projects/latest_AT.csv (or maybe later it will be something like https://data.kohesio.ec.europa.eu/projects/latest_AT.csv)

So we need a script or library to simulate the url in a way transparent to the users. I found https://github.com/rufuspollock/s3-bucket-listing with a demo on http://data.openspending.org/ : good for folders (e.g. http://data.openspending.org/datasets/abc/data/) but not for files (http://data.openspending.org.s3-eu-west-1.amazonaws.com/datasets/abc/data/tmp.csv).

@svili could you look into this?

madewild commented 2 years ago

@svili what's the status of this inside the cluster?

svili commented 2 years ago

It was only a POC so far, but if @raphdom says ok, I'll add the nginx-s3-proxy to the cluster next week.

madewild commented 2 years ago

Let's move this to 1.1

madewild commented 2 years ago

So when it's ready it should be exposed to https://kohesio.ec.europa.eu/data (+ other ENVs)

madewild commented 2 years ago

@svili @raphdom let's not forget this one! ;)

raphdom commented 2 years ago

oki doki

madewild commented 2 years ago

On the page https://kohesio.test.ec.europa.eu/services the link "all the data" already points to https://kohesio.test.ec.europa.eu/data but currently it redirects to https://kohesio.test.ec.europa.eu/404

raphdom commented 2 years ago

now https://kohesio.test.ec.europa.eu/data is working fine with the new nginx-s3-proxy app. Let's keep this here to move this to acc and prod

madewild commented 2 years ago

What is this "Screenshot 2022-02-18 at 07.39.41.png" ? ;)

madewild commented 2 years ago

And more exports are empty on https://kohesio.test.ec.europa.eu/data/projects/ so it's not working...

svili commented 2 years ago

The gateway is working, there's just bad data in the bucket, will fix it.

raphdom commented 2 years ago

on acceptance now. the workflow to release on acc is copying the data from staging and @svili copying the latest image from staging ECR also.

madewild commented 2 years ago

@svili what's the status of this? I still see many empty files on https://kohesio.test.ec.europa.eu/data/

svili commented 2 years ago

I can always upload the newest files from the wikibase machine, thats not a problem. I created python script for the dump creation and uploading and tested it yesterday. The problem was the bigger stuff like Italy times out (timeout is 1 hour for each country creation), and I don't know yet if its because the sparql is slow or just the network connection, Im testing it now. If the latter thats not a problem cause they will run on the same cluster as the qanswer.

I fixed some bugs in the meantime, like some of the countries not having language codes specified when dumping hence the dumps were probably wrong (for Lithuania and Greece specifically). (?)

If you want I can upload the data in the meantime, so its there atleast?

madewild commented 2 years ago

No not necessary, but this is the only part that is missing for the release now... just want to be sure that we are on track to make it fully operational by next Thursday. Don't hesitate to check with @D063520 and @DiaZork if there is anything suspect about the contents of the exports.

madewild commented 2 years ago

Ah and we have another bug... If you go to https://kohesio.test.ec.europa.eu/services and click "all the data" you get a 404 because the routerLink does not work (https://kohesio.test.ec.europa.eu/data/ is not really inside the Angular app). If you right click and open in a new tab then it works. @raphdom what is the right way to make it work? we cannot hardcode the full URL since it will change from env to env...

svili commented 2 years ago

just use href instead of [routerLink] I think.

raphdom commented 2 years ago

yeah, that's it

raphdom commented 2 years ago

use href="/data"

madewild commented 2 years ago

Mmmh does not work on my local... but I pushed anyway to try on dev.

madewild commented 2 years ago

I confirm it does NOT work with href on https://kohesio.development.ec.europa.eu/services

svili commented 2 years ago

I don't think there is an nginx-gateway set up for dev is there?

svili commented 2 years ago

the /data is not part of the site, the loadbalancer (or whatever its called right now in kubernetes) reroutes the traffic to the nginx gateway, so thats why its not working on dev and local.

madewild commented 2 years ago

ah damn sorry, so we first need to deploy to test, fair enough

madewild commented 2 years ago

Working on https://kohesio.test.ec.europa.eu/services now ;)

madewild commented 2 years ago

Including fixing Italy issue

madewild commented 2 years ago

@svili do you see a solution for Italy?

D063520 commented 2 years ago

I think I can propose a solution .... the idea would be to export italy program by program. This is not always possible in other countries since the program might not be available, but I think for italy it is the case. The idea would be: 1) find all programs 2) for each program extract the projects 3) join the csv files in bash Would that be fine for you? We would avoid additional infrastructure ... I can give a look at the queries we need and you can put it together ....

svili commented 2 years ago

@D063520 Yes please, if you could modify the query for me that'd be way faster than if I tried that. Here is the current query used in the code.

D063520 commented 2 years ago

@svili can you give me access to the repo

svili commented 2 years ago

Added you.

D063520 commented 2 years ago

I checked, there is no project without program

Screenshot 2022-05-16 at 17 20 45

Ok, so only for italy you first ask:

PREFIX wd: <https://linkedopendata.eu/entity/>
PREFIX wdt: <https://linkedopendata.eu/prop/direct/>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT distinct ?p WHERE {
  ?s wdt:P35 wd:Q9934 .
  ?s wdt:P32 wd:Q15 .
  ?s wdt:P1368 ?p
}

this will give you all programms. Then in the query you add at the very beginning the triple:

?link <https://linkedopendata.eu/prop/direct/P1368> < program >

and you query for each of the . The query after the WHERE clause should look like:

PREFIX wd: <http://www.wikidata.org/entity/>
select (?link as ?Operation_Unique_Identifier)  where  
    {
    ?link <https://linkedopendata.eu/prop/direct/P35> <https://linkedopendata.eu/entity/Q9934> . 
    ?link <https://linkedopendata.eu/prop/direct/P32> <https://linkedopendata.eu/entity/Q15> 
    .
    ?link <https://linkedopendata.eu/prop/direct/P1368> <program> . 
} group by ?link

Let me know if you run into troubles

svili commented 2 years ago

I added this, the next export should have italy.

D063520 commented 2 years ago

how much does it take? Just to understand ....

svili commented 2 years ago

Didn't try all of it, filtered on 1 program of ~5000 italian projects, didn't take a minute.

svili commented 2 years ago

It worked. Took 1 hour for whole Italy. https://kohesio.test.ec.europa.eu/data/projects/ (Don't mind the too much files, if the exporter didn't run for some reason it couldn't delete the stuff 1 week before. This can be solved easily but this is only staging anyway)

madewild commented 2 years ago

So all is fine now, can we close?

svili commented 2 years ago

Done:

Whats not done:

raphdom commented 2 years ago

let's create another issue with the improvements for next release? for now seems to be good, @madewild ?