Why is the converted data (acm/acm-objects/) spilt into 7 files?

kateblanch commented 7 years ago

It would be helpful to understand the logic behind splitting the data into 7 files. Going forward, we're going to need to know on what parameter the splits were made and what purpose it serves. It's hard to manage 7 files.

kateblanch commented 7 years ago

https://github.com/american-art/acm/blob/master/acm_split_file.py

I see the above file was created to do the split, but if someone could translate the "why" that'd be great :)

workergnome commented 7 years ago

They just pull 1000 items, write it to a file, and then open the next file to write the next 1000 items. I would bet that Karma has a limitation in the size of XML file that it can load, and this is an easy way to work around that limitation.

caknoblock commented 7 years ago

David is correct in that Karma can not process huge XML files. The issue is that the processing of the data is run in Spark and that XML is not a streaming file format, so it must load the entire file into memory. But the full XML file is too large and the system cannot handle it. So we wrote a script that automatically divides the file into smaller files. New data can still be published in the original single file format.

On Jun 26, 2017, at 6:40 PM, David Newbury notifications@github.com wrote:

They just pull 1000 items, write it to a file, and then open the next file to write the next 1000 items. I would bet that Karma has a limitation in the size of XML file that it can load, and this is an easy way to work around that limitation.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/american-art/acm/issues/16#issuecomment-311228383, or mute the thread https://github.com/notifications/unsubscribe-auth/ABB-qajEflYRpzjVnxOxpi3HW7Te7V-Xks5sIF1zgaJpZM4OFVnM.

{"api_version":"1.0","publisher":{"api_key":"05dde50f1d1a384dd78767c55493e4bb","name":"GitHub"},"entity":{"external_key":"github/american-art/acm","title":"american-art/acm","subtitle":"GitHub repository","main_image_url":"https://cloud.githubusercontent.com/assets/143418/17495839/a5054eac-5d88-11e6-95fc-7290892c7bb5.png","avatar_image_url":"https://cloud.githubusercontent.com/assets/143418/15842166/7c72db34-2c0b-11e6-9aed-b52498112777.png","action":{"name":"Open in GitHub","url":"https://github.com/american-art/acm"}},"updates":{"snippets":[{"icon":"PERSON","message":"@workergnome in #16: They just pull 1000 items, write it to a file, and then open the next file to write the next 1000 items. I would bet that Karma has a limitation in the size of XML file that it can load, and this is an easy way to work around that limitation.\r\n\r\n"}],"action":{"name":"View Issue","url":"https://github.com/american-art/acm/issues/16#issuecomment-311228383"}}}

american-art / acm

Why is the converted data (acm/acm-objects/) spilt into 7 files? #16