elastic / stream2es

Stream data into ES (Wikipedia, Twitter, stdin, or other ESes)
355 stars 62 forks source link

[Wikidata] Json dumps import. #43

Closed SergeC closed 8 years ago

SergeC commented 9 years ago

Hi. Can you please add import of json dumps from https://dumps.wikimedia.org/other/wikidata/ ? Or in case if it possible to load via stdin please provide me some instructions.

drewr commented 9 years ago

I haven't retrieved one of those archives before. What JSON format do they use? If it's jsonlines, then it would just be a matter of curl -s https://dumps.wikimedia.org/other/wikidata/20150504.json.gz | stream2es stdin --log debug. If they put all the content in a single data structure you'll need a streaming JSON parser, for which there isn't a stream2es stream yet.

SergeC commented 9 years ago

The format is array of line separated jsons. Like [ {}, {}, ]

The problems are [] and commas at the end of each line.

The output is: curl -s https://dumps.wikimedia.org/other/wikidata/20150504.json.gz | ./stream2es stdin --log debug 2015-05-05T13:46:36.332-0400 DEBUG stream stdin to http://localhost:9200/foo/t Exception in thread "stream dispatcher" com.fasterxml.jackson.core.JsonParseException: Illegal character ((CTRL-CHAR, code 31)): only regular white space (\r, \n, \t) is allowed between tokens at [Source: java.io.StringReader@3bec1113; line: 1, column: 2] at com.fasterxml.jackson.core.JsonParser._constructError(JsonParser.java:1378) at com.fasterxml.jackson.core.base.ParserMinimalBase._reportError(ParserMinimalBase.java:599) at com.fasterxml.jackson.core.base.ParserMinimalBase._throwInvalidSpace(ParserMinimalBase.java:545) at com.fasterxml.jackson.core.json.ReaderBasedJsonParser._skipWSOrEnd(ReaderBasedJsonParser.java:1674) at com.fasterxml.jackson.core.json.ReaderBasedJsonParser.nextToken(ReaderBasedJsonParser.java:561) at cheshire.parse$parse.invoke(parse.clj:58) at cheshire.core$parse_string.doInvoke(core.clj:87) at clojure.lang.RestFn.invoke(RestFn.java:423) at stream2es.stream.stdin$fn3550.invoke(stdin.clj:54) at stream2es.stream$fn2767$G27622774.invoke(stream.clj:12) at stream2es.main$make_object_processor$fn4056.invoke(main.clj:221) at stream2es.main$start_doc_stream$disp4030.invoke(main.clj:206) at clojure.lang.AFn.run(AFn.java:24) at java.lang.Thread.run(Thread.java:745)

drewr commented 9 years ago

That would involve a streaming JSON parser. What's the problem with the integrated wiki stream?

diadistis commented 9 years ago

A bit too late maybe, but I'll post it anyway for anyone who might find this one-liner for importing wikidata useful: curl -s http://dumps.wikimedia.org/other/wikidata/20150810.json.gz | zcat | sed 's/,$//' | sed '/^\(\[\|\]\)$/d' | stream2es stdin --target http://xxx:9200/index/target

parasitid commented 8 years ago

you could use the 'jq' tool to process these lines instead of seds. with the --stream opt to process huge files.

curl -s http://dumps.wikimedia.org/other/wikidata/20150810.json.gz | zcat | jq -nc -r --stream 'fromstream(1|truncate_stream(inputs)) | ...