diegoceccarelli / json-wikipedia

Json Wikipedia, contains code to convert the Wikipedia xml dump into a json/avro dump
Apache License 2.0
252 stars 42 forks source link

Add link types and metadata #27

Closed diegoceccarelli closed 6 years ago

diegoceccarelli commented 6 years ago

I made a test on my server, It took around 10 hours to convert a full dump, and now the dump is 17Gb compressed (vs 15gb with the normal links). I think it's fine.

tgalery commented 6 years ago

I'm just wondering if we should squash some of the commits (I might try to backport some of these changes to the idio repo). I'm happy to sqash some of mine and @rod3go 's commits. If you are ok with that should I push to this branch straight ?

diegoceccarelli commented 6 years ago

I think that when you merge you can optionally ask to squash the commit in one unique one. I would do that..

On 10 Oct 2017 2:02 pm, "Thiago Galery" notifications@github.com wrote:

I'm just wondering if we should squash some of the commits (I might try to backport some of these changes to the idio repo). I'm happy to sqash some of mine and @rod3go https://github.com/rod3go 's commits. If you are ok with that should I push to this branch straight ?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/diegoceccarelli/json-wikipedia/pull/27#issuecomment-335464856, or mute the thread https://github.com/notifications/unsubscribe-auth/ABDFDoQdZZc1unky-2jsekeIMqOGF1Ibks5sq2rSgaJpZM4PxauM .

diegoceccarelli commented 6 years ago

@tgalery I indented the code (it was weird indeed) and squashed everything in one unique commit

tgalery commented 6 years ago

@diegoceccarelli looks good to me. I'm happy for it to be merged.

diegoceccarelli commented 6 years ago

Go for it! :)

On 13 Oct 2017 11:08 am, "Thiago Galery" notifications@github.com wrote:

@diegoceccarelli https://github.com/diegoceccarelli looks good to me. I'm happy for it to be merged.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/diegoceccarelli/json-wikipedia/pull/27#issuecomment-336410475, or mute the thread https://github.com/notifications/unsubscribe-auth/ABDFDmOFKtCt41-73RU83_R7eGlZu6Qqks5srzafgaJpZM4PxauM .