Open Olivier4477 opened 3 years ago
The easiest is to run two instances (ideally on two machines). Load into one in the background, cut over once loaded, then delete the journal on the other instance and start your next load there.
On Thu, Jun 10, 2021 at 06:18 Olivier4477 @.***> wrote:
Hello,
I discover blazegraph. I want to use government data for an app.
However, the data (.rdf) is very big (3.30go) For example, if I do: curl -X POST -H "Content-Type: application / rdf + xml" --data-binary @ flux.rdf "http: // localhost: 9999 / blazegraph / namespace / kb / sparql"
Blazegraph takes about 2 hours to load the data. When I wrote to you, I did: curl -X POST http: // localhost: 9999 / blazegraph / namespace / kb / sparql --data-urlencode 'update = DROP ALL'
Obviously the drop time is also very long.
Knowing that the data (.rdf) is updated every day, how can I update blazegraph? Is it possible to update blazegraph without deleting (drop all)?
How can I speed up the upload / update of data?
Thanking you
Have a good day
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/blazegraph/database/issues/203, or unsubscribe https://github.com/notifications/unsubscribe-auth/AATW7YEO2W7OQ3NMP4YEZBTTSC3UBANCNFSM46OQIYTQ .
thank you for your reply.
But I already have to use a minimum 8GB machine for blazegraph to work ... If I have to use a second it is not the same budget.
Is it really the only solution?
It is not possible for example: curl -X POST -H "Content-Type: application / rdf + xml" --data-binary @ flux.rdf "http: // localhost: 9999 / blazegraph / namespace / kb / sparql" to specify a table name (example current date) at midnight load the update (with the new name of the table) and delete the table from the days before?
Or another possibility?
I really want to use the data the government provides me but it's RDF / sparql ...
thank you so much
Run two instances on the same machine then.
There is no trivial way to identify all of the allocations in the storage layer associated with one loaded triple or quad store such that they may be trivially dropped.
It is possible to use lower level apis to drop indices but you might not be freeing up the allocations immediately if you do that - this depends on how the rwstore is set up.
On the other hand, as long as the machine can handle the two workloads (load and query) you can just use two instances.
You can also use the DataLoader for loading into the second one. This way you can always have the full database responding at the same URL and port with a short downtime when you kill that process and restart it over the other database.
On Thu, Jun 10, 2021 at 07:00 Olivier4477 @.***> wrote:
thank you for your reply.
But I already have to use a minimum 8GB machine for blazegraph to work ... If I have to use a second it is not the same budget.
Is it really the only solution?
It is not possible for example: curl -X POST -H "Content-Type: application / rdf + xml" --data-binary @ flux.rdf "http: // localhost: 9999 / blazegraph / namespace / kb / sparql" to specify a table name (example current date) at midnight load the update (with the new name of the table) and delete the table from the days before?
Or another possibility?
I really want to use the data the government provides me but it's RDF / sparql ...
thank you so much
— You are receiving this because you commented.
Reply to this email directly, view it on GitHub https://github.com/blazegraph/database/issues/203#issuecomment-858648182, or unsubscribe https://github.com/notifications/unsubscribe-auth/AATW7YDQYCWHFF4LJJDZYY3TSDAQRANCNFSM46OQIYTQ .
Ok I think I understood your logic, but to put it into practice I will need help.
I'll explain, I use a docker-compose like this:
This image is provided in government documentation for data usage.
So for the moment I do: docker-compose up then I load the data like this: curl -X POST -H "Content-Type: application / rdf + xml" --data-binary @ flux.rdf "http: // localhost: 9999 / blazegraph / namespace / kb / sparql" (the data file must be stored in dataset / kb / data
Then, if I want to reload I must: docker-compose rm blazegraph then docker system plum then relaunch blazegraph
This is how I proceed now.
Before this solution, I used apache java Jena for sparql, it took 5 hours to load the data (on my computer 32 gb of ram)
Not a docker expert. You’ll need to get someone else’s advise on that.
On Thu, Jun 10, 2021 at 07:16 Olivier4477 @.***> wrote:
Ok I think I understood your logic, but to put it into practice I will need help.
I'll explain, I use a docker-compose like this:
version: '3.1'
services:
blazegraph: image: conjecto/blazegraph:2.1.5 restart: always ports: - 9999:9999 environment: JAVA_OPTS: "-Xms2g -Xmx3g" volumes: - ./dataset:/docker-entrypoint-initdb.d datatourisme: build: docker ports: - "8080:80" restart: always depends_on: - blazegraph
This image is provided in government documentation for data usage.
So for the moment I do: docker-compose up then I load the data like this: curl -X POST -H "Content-Type: application / rdf + xml" --data-binary @ flux.rdf "http: // localhost: 9999 / blazegraph / namespace / kb / sparql" (the data file must be stored in dataset / kb / data
Then, if I want to reload I must: docker-compose rm blazegraph then docker system plum then relaunch blazegraph
This is how I proceed now.
Before this solution, I used apache java Jena for sparql, it took 5 hours to load the data (on my computer 32 gb of ram)
— You are receiving this because you commented.
Reply to this email directly, view it on GitHub https://github.com/blazegraph/database/issues/203#issuecomment-858660800, or unsubscribe https://github.com/notifications/unsubscribe-auth/AATW7YC76JSU3ZASGMM5ENLTSDCLLANCNFSM46OQIYTQ .
Ok but ... how would you have done? Use blazegraph.jar directly?
in any case thank you very much, hoping that another person can take over to help me
Thank you so much!
Hello,
I discover blazegraph. I want to use government data for an app.
However, the data (.rdf) is very big (3.30go) For example, if I do: curl -X POST -H "Content-Type: application / rdf + xml" --data-binary @ flux.rdf "http: // localhost: 9999 / blazegraph / namespace / kb / sparql"
Blazegraph takes about 2 hours to load the data. When I wrote to you, I did: curl -X POST http: // localhost: 9999 / blazegraph / namespace / kb / sparql --data-urlencode 'update = DROP ALL'
Obviously the drop time is also very long.
Knowing that the data (.rdf) is updated every day, how can I update blazegraph? Is it possible to update blazegraph without deleting (drop all)?
How can I speed up the upload / update of data?
Thanking you
Have a good day