eclipse-rdf4j / rdf4j

Eclipse RDF4J: scalable RDF for Java
https://rdf4j.org/
BSD 3-Clause "New" or "Revised" License
362 stars 165 forks source link

Support for HDT #232

Open barthanssens opened 8 years ago

barthanssens commented 8 years ago

Would be nice to have a RIO reader/writer for HDT

ansell commented 8 years ago

Some related background for a failed Google Summer of Code project hosted by Apache Marmotta that I was involved in can be found at:

https://issues.apache.org/jira/browse/MARMOTTA-593

The main sticking point was that Apache Marmotta and Eclipse RDF4J cannot reuse any of the core RDF/HDT library because they release it as LGPL. Looking at it again, they provide an "API" module that is Apache licensed, but haven't looked through it to know if it would actually be useful without the core module and whether it would limit the way HDT could be implemented in RDF4J:

https://github.com/rdfhdt/hdt-java

abrokenjester commented 8 years ago

Just in case people are not aware: RDF4J does already have support for a binary RDF format, developed in-house. Documentation can be found here:http://rdf4j.org/doc/rdf4j-binary-rdf-format/ .

Compared to HDT it lacks some important features (indexing/immediate search-browse in particular), but it does offer a compact transfer/storage format.

barthanssens commented 8 years ago

Well, I'll give it a try and see how far I get (based upon the spec itself to avoid license issues). It can take a while though ;-)

ansell commented 8 years ago

The other issue that we had for Marmotta was that the spec wasn't up to date with the current implementation, so you couldn't rely on it for interpreting any files generated by hdt-java. That situation may have improved since then if the spec has been updated recently.

ansell commented 8 years ago

@jeenbroekstra The HDT-Java format indexes are being utilised for in-place searching/filtering by a few different groups in bioinformatics so they are proving to be valuable at least for that area.

barthanssens commented 8 years ago

@ansell I happen to know the Belgian guys working on hdt-java/hdt-cpp, so I'll ask them if the spec on hdt.org has been updated accordingly (it is mentioned that the HDT spec was updated after submitting it to W3C, so hopefully the website is now documenting the latest version)

abrokenjester commented 8 years ago

Great that you want to try and pick this up Bart! Before you dig in, please have a look at the contributor guidelines, in particular the points on how to sign the Eclipse CLA and how pick the right branch etc. It's easier if you get this sorted before you start committing fixes :) Let me know if you need any help with any of it.

barthanssens commented 8 years ago

Thanks. FYI, I'm working on the reader-part (at least the easiest bits so far), and I've contacted one of the maintainers of the HDT spec. The good news is that they are going to make it available on github, so it should be easier to contribute / keep the documentation up to date.

barthanssens commented 8 years ago

List of (mostly small) issues with the spec https://github.com/rdfhdt/rdfhdt.org/issues.

barthanssens commented 7 years ago

Not much progress at the moment, quite a few questions on the RDFHDT list remain unanswered :-/

barthanssens commented 4 years ago

First (experimental / non-optimized) version of the HDT parser seems to work on a specific set of files, so perhaps this could go into 3.2.0 as experimental feature.

barthanssens commented 4 years ago

Initial support for parser is merged, so I can get started with testing it with larger files :-)

abrokenjester commented 4 years ago

What is the current status of this? I know we merged an initial implementation of a parser, but that the writer gave us trouble. This has been sitting in "blocked" for a while now. Should we close with a "won't fix" for now? Or put it back in the backlog?

barthanssens commented 4 years ago

Yeah the writer wasn't good performance-wise, so I have to rethink it (and perhaps check how the HDT c++ program is generating HDT)
I don't have much time to work on it, unfortunately ... so backlog it is...