Closed rivettp closed 2 years ago
There's another issue (which as far as i can tell has not been reported as an issue), which Vladimir reported on in a comment, and that is that riot does not work on this output, because of lack of Unicode C compliance. Since the obvious way to do this issue is to use riot, you can't fix this without fixing the encoding issue.
Along the way of testing this, I discovered that the addresses in the original GLEIF data has values in the field "Jurisdiction" in some countries (Colombia and Germany in particular) that do not correspond to the subdivision codes in LCC; in the case of Germany, LCC has the States (there are just 16 of them); but GLEIF uses municipalities in this situation. This results in a bunch of mismatch warnings; I'm not sure what ra-to-rdf does in these cases.
This probably isn't new, so all the versions probably have the same issue.
There is another alternative to Riot that I told @bryonjacob and Dash about: Pete Rivett 11:54 AM I found a great alternative which worked right off with L1Data https://rdfpro.fbk.eu/ though it seems to add more formatting so not quite as small. However L1Data shrinks from 9.3Gb to 5.2Gb. Very nice tool with lots of transformation options. Easy download and install, just a single zip. Just run rdfpro -V @read LiData.rdf @write L1Data.ttl (you can omit -V which is just for tracing). Averages 145k triples per sec. Source is at https://github.com/dkmfbk/rdfpro
Even if we use a different tool than riot, we'll still want to fix the encodings (which I have already done); other people might want to download the files and use them with Jena. But I'll look into rdfpro; if it goes faster than riot, that's a good reason to use it for the large files.
This will reduce size and make them more readable and consistent. We have experimented with command line utilities that can do this with reasonable performance.