Open barthanssens opened 8 years ago
Some related background for a failed Google Summer of Code project hosted by Apache Marmotta that I was involved in can be found at:
https://issues.apache.org/jira/browse/MARMOTTA-593
The main sticking point was that Apache Marmotta and Eclipse RDF4J cannot reuse any of the core RDF/HDT library because they release it as LGPL. Looking at it again, they provide an "API" module that is Apache licensed, but haven't looked through it to know if it would actually be useful without the core module and whether it would limit the way HDT could be implemented in RDF4J:
Just in case people are not aware: RDF4J does already have support for a binary RDF format, developed in-house. Documentation can be found here:http://rdf4j.org/doc/rdf4j-binary-rdf-format/ .
Compared to HDT it lacks some important features (indexing/immediate search-browse in particular), but it does offer a compact transfer/storage format.
Well, I'll give it a try and see how far I get (based upon the spec itself to avoid license issues). It can take a while though ;-)
The other issue that we had for Marmotta was that the spec wasn't up to date with the current implementation, so you couldn't rely on it for interpreting any files generated by hdt-java. That situation may have improved since then if the spec has been updated recently.
@jeenbroekstra The HDT-Java format indexes are being utilised for in-place searching/filtering by a few different groups in bioinformatics so they are proving to be valuable at least for that area.
@ansell I happen to know the Belgian guys working on hdt-java/hdt-cpp, so I'll ask them if the spec on hdt.org has been updated accordingly (it is mentioned that the HDT spec was updated after submitting it to W3C, so hopefully the website is now documenting the latest version)
Great that you want to try and pick this up Bart! Before you dig in, please have a look at the contributor guidelines, in particular the points on how to sign the Eclipse CLA and how pick the right branch etc. It's easier if you get this sorted before you start committing fixes :) Let me know if you need any help with any of it.
Thanks. FYI, I'm working on the reader-part (at least the easiest bits so far), and I've contacted one of the maintainers of the HDT spec. The good news is that they are going to make it available on github, so it should be easier to contribute / keep the documentation up to date.
List of (mostly small) issues with the spec https://github.com/rdfhdt/rdfhdt.org/issues.
Not much progress at the moment, quite a few questions on the RDFHDT list remain unanswered :-/
First (experimental / non-optimized) version of the HDT parser seems to work on a specific set of files, so perhaps this could go into 3.2.0 as experimental feature.
Initial support for parser is merged, so I can get started with testing it with larger files :-)
What is the current status of this? I know we merged an initial implementation of a parser, but that the writer gave us trouble. This has been sitting in "blocked" for a while now. Should we close with a "won't fix" for now? Or put it back in the backlog?
Yeah the writer wasn't good performance-wise, so I have to rethink it (and perhaps check how the HDT c++ program is generating HDT)
I don't have much time to work on it, unfortunately ... so backlog it is...
Would be nice to have a RIO reader/writer for HDT