blazegraph / database

Blazegraph High Performance Graph Database
GNU General Public License v2.0
872 stars 170 forks source link

I'm new to blazegraph, could you clarify? #203

Open Olivier4477 opened 3 years ago

Olivier4477 commented 3 years ago

Hello,

I discover blazegraph. I want to use government data for an app.

However, the data (.rdf) is very big (3.30go) For example, if I do: curl -X POST -H "Content-Type: application / rdf + xml" --data-binary @ flux.rdf "http: // localhost: 9999 / blazegraph / namespace / kb / sparql"

Blazegraph takes about 2 hours to load the data. When I wrote to you, I did: curl -X POST http: // localhost: 9999 / blazegraph / namespace / kb / sparql --data-urlencode 'update = DROP ALL'

Obviously the drop time is also very long.

Knowing that the data (.rdf) is updated every day, how can I update blazegraph? Is it possible to update blazegraph without deleting (drop all)?

How can I speed up the upload / update of data?

Thanking you

Have a good day

thompsonbry commented 3 years ago

The easiest is to run two instances (ideally on two machines). Load into one in the background, cut over once loaded, then delete the journal on the other instance and start your next load there.

On Thu, Jun 10, 2021 at 06:18 Olivier4477 @.***> wrote:

Hello,

I discover blazegraph. I want to use government data for an app.

However, the data (.rdf) is very big (3.30go) For example, if I do: curl -X POST -H "Content-Type: application / rdf + xml" --data-binary @ flux.rdf "http: // localhost: 9999 / blazegraph / namespace / kb / sparql"

Blazegraph takes about 2 hours to load the data. When I wrote to you, I did: curl -X POST http: // localhost: 9999 / blazegraph / namespace / kb / sparql --data-urlencode 'update = DROP ALL'

Obviously the drop time is also very long.

Knowing that the data (.rdf) is updated every day, how can I update blazegraph? Is it possible to update blazegraph without deleting (drop all)?

How can I speed up the upload / update of data?

Thanking you

Have a good day

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/blazegraph/database/issues/203, or unsubscribe https://github.com/notifications/unsubscribe-auth/AATW7YEO2W7OQ3NMP4YEZBTTSC3UBANCNFSM46OQIYTQ .

Olivier4477 commented 3 years ago

thank you for your reply.

But I already have to use a minimum 8GB machine for blazegraph to work ... If I have to use a second it is not the same budget.

Is it really the only solution?

It is not possible for example: curl -X POST -H "Content-Type: application / rdf + xml" --data-binary @ flux.rdf "http: // localhost: 9999 / blazegraph / namespace / kb / sparql" to specify a table name (example current date) at midnight load the update (with the new name of the table) and delete the table from the days before?

Or another possibility?

I really want to use the data the government provides me but it's RDF / sparql ...

thank you so much

thompsonbry commented 3 years ago

Run two instances on the same machine then.

There is no trivial way to identify all of the allocations in the storage layer associated with one loaded triple or quad store such that they may be trivially dropped.

It is possible to use lower level apis to drop indices but you might not be freeing up the allocations immediately if you do that - this depends on how the rwstore is set up.

On the other hand, as long as the machine can handle the two workloads (load and query) you can just use two instances.

You can also use the DataLoader for loading into the second one. This way you can always have the full database responding at the same URL and port with a short downtime when you kill that process and restart it over the other database.

On Thu, Jun 10, 2021 at 07:00 Olivier4477 @.***> wrote:

thank you for your reply.

But I already have to use a minimum 8GB machine for blazegraph to work ... If I have to use a second it is not the same budget.

Is it really the only solution?

It is not possible for example: curl -X POST -H "Content-Type: application / rdf + xml" --data-binary @ flux.rdf "http: // localhost: 9999 / blazegraph / namespace / kb / sparql" to specify a table name (example current date) at midnight load the update (with the new name of the table) and delete the table from the days before?

Or another possibility?

I really want to use the data the government provides me but it's RDF / sparql ...

thank you so much

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/blazegraph/database/issues/203#issuecomment-858648182, or unsubscribe https://github.com/notifications/unsubscribe-auth/AATW7YDQYCWHFF4LJJDZYY3TSDAQRANCNFSM46OQIYTQ .

Olivier4477 commented 3 years ago

Ok I think I understood your logic, but to put it into practice I will need help.

I'll explain, I use a docker-compose like this:

This image is provided in government documentation for data usage.

So for the moment I do: docker-compose up then I load the data like this: curl -X POST -H "Content-Type: application / rdf + xml" --data-binary @ flux.rdf "http: // localhost: 9999 / blazegraph / namespace / kb / sparql" (the data file must be stored in dataset / kb / data

Then, if I want to reload I must: docker-compose rm blazegraph then docker system plum then relaunch blazegraph

This is how I proceed now.

Before this solution, I used apache java Jena for sparql, it took 5 hours to load the data (on my computer 32 gb of ram)

thompsonbry commented 3 years ago

Not a docker expert. You’ll need to get someone else’s advise on that.

On Thu, Jun 10, 2021 at 07:16 Olivier4477 @.***> wrote:

Ok I think I understood your logic, but to put it into practice I will need help.

I'll explain, I use a docker-compose like this:

version: '3.1'

services:

blazegraph:
    image: conjecto/blazegraph:2.1.5
    restart: always
    ports:
      - 9999:9999
    environment:
        JAVA_OPTS: "-Xms2g -Xmx3g"
    volumes:
      - ./dataset:/docker-entrypoint-initdb.d

datatourisme:
    build: docker
    ports:
        - "8080:80"
    restart: always
    depends_on:
        - blazegraph

This image is provided in government documentation for data usage.

So for the moment I do: docker-compose up then I load the data like this: curl -X POST -H "Content-Type: application / rdf + xml" --data-binary @ flux.rdf "http: // localhost: 9999 / blazegraph / namespace / kb / sparql" (the data file must be stored in dataset / kb / data

Then, if I want to reload I must: docker-compose rm blazegraph then docker system plum then relaunch blazegraph

This is how I proceed now.

Before this solution, I used apache java Jena for sparql, it took 5 hours to load the data (on my computer 32 gb of ram)

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/blazegraph/database/issues/203#issuecomment-858660800, or unsubscribe https://github.com/notifications/unsubscribe-auth/AATW7YC76JSU3ZASGMM5ENLTSDCLLANCNFSM46OQIYTQ .

Olivier4477 commented 3 years ago

Ok but ... how would you have done? Use blazegraph.jar directly?

in any case thank you very much, hoping that another person can take over to help me

Thank you so much!