Allow TPF querying of quads

RubenVerborgh commented 9 years ago

It seems that datasets in quad format cannot be queried yet as Triple pattern Fragments / HDT. The easy way to do this would be:

Convert quads to triples by dropping the fourth element
Convert resulting dataset into HDT

While 1. might be up for debate, it's better than nothing. Furthermore, a substantial part of quad datasets have the same graph URI for (almost) all of their elements, which does not add a lot of information.

LaurensRietveld commented 9 years ago

This should already happen. See e.g. this quad file http://download.lodlaundromat.org/8c4b544fd011889a8273ea5b70c55377, which is accessible as TPF here http://ldf.lodlaundromat.org/8c4b544fd011889a8273ea5b70c55377 and as hdt here http://download.lodlaundromat.org/8c4b544fd011889a8273ea5b70c55377?type=hdt )

Do you have an example where this fails?

RubenVerborgh commented 9 years ago

I tried it with http://download.bio2rdf.org/release/3/drugbank/drugbank.nq.gz and a couple more. Should be http://ldf.lodlaundromat.org/d83770299490c295aaa292418e34c26c, but doesn't work.

LaurensRietveld commented 9 years ago

Some background: The problem was a glitch in our pipeline, causing the final part (updating/refreshing the LDF api) to fail. The example I gave was processed before I introduced this bug (around friday), and the file you gave was only recently processed (this morning).

The issue is now resolved, and the server is now catching up to add these datasets to the LDF api. (these datasets should be there in an hour or so)

Anyway, thanks for the report ;)

RubenVerborgh commented 9 years ago

Great, thanks!

RubenVerborgh commented 9 years ago

Some other quads I checked are still "pending", i.e. http://download.bio2rdf.org/release/3/ncbigene/gene2accession.nq.gz.

Is that expected behavior, or are they blocked?

LaurensRietveld commented 9 years ago

First, the basket listing showed this dataset as 'pending', where its status should have been 'unpacked'. I've resolved this in issue #76.

The basket now shows 460 entries in the queue (including the dataset you mentioned). This is expected behaviour. Some background on why this is happening: Our pipeline consists of unpacking, and cleaning the data. To keep the memory and disk use under control, we've got separate processes and queues for these jobs (i.e., it's not a serial pipeline). And, we've got different queues for small and large datasets. (fewer threads for big ones. This is both to keep the memory use under control, but also for submissions of small datasets to be quickly cleaned). What happens for some of the datasets left in the queue: unpacking went faster than the cleaning stage, which means there is a backlog of cleaning. And: most of these datasets are quite large. As we only have x number of threads allocated for medium and large datasets, this queue only slowly dissolves.

RubenVerborgh commented 9 years ago

Thanks!

LOD-Laundromat / lodlaundry.github.io

Allow TPF querying of quads #75