Closed madewild closed 2 years ago
@svili what's the status of this inside the cluster?
It was only a POC so far, but if @raphdom says ok, I'll add the nginx-s3-proxy to the cluster next week.
Let's move this to 1.1
So when it's ready it should be exposed to https://kohesio.ec.europa.eu/data (+ other ENVs)
@svili @raphdom let's not forget this one! ;)
oki doki
On the page https://kohesio.test.ec.europa.eu/services the link "all the data" already points to https://kohesio.test.ec.europa.eu/data but currently it redirects to https://kohesio.test.ec.europa.eu/404
now https://kohesio.test.ec.europa.eu/data is working fine with the new nginx-s3-proxy app. Let's keep this here to move this to acc and prod
What is this "Screenshot 2022-02-18 at 07.39.41.png" ? ;)
And more exports are empty on https://kohesio.test.ec.europa.eu/data/projects/ so it's not working...
The gateway is working, there's just bad data in the bucket, will fix it.
on acceptance now. the workflow to release on acc is copying the data from staging and @svili copying the latest image from staging ECR also.
@svili what's the status of this? I still see many empty files on https://kohesio.test.ec.europa.eu/data/
I can always upload the newest files from the wikibase machine, thats not a problem. I created python script for the dump creation and uploading and tested it yesterday. The problem was the bigger stuff like Italy times out (timeout is 1 hour for each country creation), and I don't know yet if its because the sparql is slow or just the network connection, Im testing it now. If the latter thats not a problem cause they will run on the same cluster as the qanswer.
I fixed some bugs in the meantime, like some of the countries not having language codes specified when dumping hence the dumps were probably wrong (for Lithuania and Greece specifically). (?)
If you want I can upload the data in the meantime, so its there atleast?
No not necessary, but this is the only part that is missing for the release now... just want to be sure that we are on track to make it fully operational by next Thursday. Don't hesitate to check with @D063520 and @DiaZork if there is anything suspect about the contents of the exports.
Ah and we have another bug... If you go to https://kohesio.test.ec.europa.eu/services and click "all the data" you get a 404 because the routerLink does not work (https://kohesio.test.ec.europa.eu/data/ is not really inside the Angular app). If you right click and open in a new tab then it works. @raphdom what is the right way to make it work? we cannot hardcode the full URL since it will change from env to env...
just use href instead of [routerLink] I think.
yeah, that's it
use href="/data"
Mmmh does not work on my local... but I pushed anyway to try on dev.
I confirm it does NOT work with href on https://kohesio.development.ec.europa.eu/services
I don't think there is an nginx-gateway set up for dev is there?
the /data is not part of the site, the loadbalancer (or whatever its called right now in kubernetes) reroutes the traffic to the nginx gateway, so thats why its not working on dev and local.
ah damn sorry, so we first need to deploy to test, fair enough
Working on https://kohesio.test.ec.europa.eu/services now ;)
Including fixing Italy issue
@svili do you see a solution for Italy?
I think I can propose a solution .... the idea would be to export italy program by program. This is not always possible in other countries since the program might not be available, but I think for italy it is the case. The idea would be: 1) find all programs 2) for each program extract the projects 3) join the csv files in bash Would that be fine for you? We would avoid additional infrastructure ... I can give a look at the queries we need and you can put it together ....
@D063520 Yes please, if you could modify the query for me that'd be way faster than if I tried that. Here is the current query used in the code.
@svili can you give me access to the repo
Added you.
I checked, there is no project without program
Ok, so only for italy you first ask:
PREFIX wd: <https://linkedopendata.eu/entity/>
PREFIX wdt: <https://linkedopendata.eu/prop/direct/>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT distinct ?p WHERE {
?s wdt:P35 wd:Q9934 .
?s wdt:P32 wd:Q15 .
?s wdt:P1368 ?p
}
this will give you all programms. Then in the query you add at the very beginning the triple:
?link <https://linkedopendata.eu/prop/direct/P1368> < program >
and you query for each of the
PREFIX wd: <http://www.wikidata.org/entity/>
select (?link as ?Operation_Unique_Identifier) where
{
?link <https://linkedopendata.eu/prop/direct/P35> <https://linkedopendata.eu/entity/Q9934> .
?link <https://linkedopendata.eu/prop/direct/P32> <https://linkedopendata.eu/entity/Q15>
.
?link <https://linkedopendata.eu/prop/direct/P1368> <program> .
} group by ?link
Let me know if you run into troubles
I added this, the next export should have italy.
how much does it take? Just to understand ....
Didn't try all of it, filtered on 1 program of ~5000 italian projects, didn't take a minute.
It worked. Took 1 hour for whole Italy. https://kohesio.test.ec.europa.eu/data/projects/ (Don't mind the too much files, if the exporter didn't run for some reason it couldn't delete the stuff 1 week before. This can be solved easily but this is only staging anyway)
So all is fine now, can we close?
Done:
Whats not done:
let's create another issue with the improvements for next release? for now seems to be good, @madewild ?
Currently all the exports are sitting on an EBS volume:
https://data.linkedopendata.eu/kohesio/projects/ https://dev.data.linkedopendata.eu/kohesio/projects/
It would be better to put them on S3 but we want to keep a neutral URL, not use containing "aws" or "s3". For instance the latest export of Austria should always be on https://data.linkedopendata.eu/kohesio/projects/latest_AT.csv (or maybe later it will be something like https://data.kohesio.ec.europa.eu/projects/latest_AT.csv)
So we need a script or library to simulate the url in a way transparent to the users. I found https://github.com/rufuspollock/s3-bucket-listing with a demo on http://data.openspending.org/ : good for folders (e.g. http://data.openspending.org/datasets/abc/data/) but not for files (http://data.openspending.org.s3-eu-west-1.amazonaws.com/datasets/abc/data/tmp.csv).
@svili could you look into this?