dbcls / LinkedData-Agora

4 stars 0 forks source link

IMGT-KG #179

Closed yayamamo closed 1 month ago

gsanou commented 10 months ago

Hello, I'm one of the developper behind IMGT-KG. During this summer, our entry point (a Fuseki server) was overloaded several times by "Umaka-Crawler/1.0.0 by DBCLS (umakadata@dbcls.jp)" queries. We have banned such requests based on IPs. We suspect that this is one of your services. It would have been nice to be contacted before applying this.

Investigating on this, we looked at DBCLS website and found the YummyData portal where IMGT-KG is among the entry points and ranked.

We found also this IMGT-KG issue in the Linked Data Agora project where we decide to open a discussion to improve the situation.

yayamamo commented 10 months ago

Hi, thank you for informing us of the issue and we're sorry for placing a heavy burdens on your server. We have issued a series of queries to share statuses of SPARQL endpoints in the life science domain with data consumers and providers, hoping to facilitate mutual understandings and the use of RDF data. It is not our intention to make excessive loads on your service, and therefore we would like to stop our crawler to access to your endpoint. In addition, we're happy if there would be alternative way of obtaining the service statuses to share.

gsanou commented 10 months ago

Hi, We saw last time you tried to make a "construct query", we did not see the other queries probably they happened after the ban. Can you give us your series of queries we'll try it ? And also, the construct query crashes probably because we have around 80M of triplet handle by fuseki, you can try this query for that :

CONSTRUCT { ?s ?p ?o. } 
{ SERVICE <https://www.imgt.org/fuseki/ImgtKg/sparql>
 { SELECT ?s ?p ?o { ?s ?p ?o } LIMIT 1 } }

Regards

yayamamo commented 9 months ago

Sorry for our late reply, as for the queries could you have a look at this page? https://yummydata.org/endpoint/161?date=2023-05-12 This is a log describing what our crawler did to your endpoint on that day. At the second table you can see document icons for each row such as Availability, which you can click and see the real request and response.

gsanou commented 9 months ago

Thank you for your updates about our endpoint.

However, we'll look at some negative/false results (for instance Support for Turtle Data Format) that seem wrong for us and regarding apache fuseki. Also, we saw that the "Class Structure" is empty whereas a link to it is available from Site Description (https://imgt.org/imgt-kg/kgmodel.html).

yayamamo commented 2 months ago

@gsanou Hi, sorry for our late action. Now we've fixed our crawler based on your suggestions.