atlarge-research / AIP

An instrument to combine, unify, and correct (scientific) article meta-data.
10 stars 8 forks source link

PERFORMANCE: Make AIP runnable through Dask or another platform to parallize the parsing #5

Open lfdversluis opened 4 years ago

lfdversluis commented 4 years ago

Each sub-set of data and each data source can be processed in parallel. Dask can be used to parallelize this.

lfdversluis commented 4 years ago

https://joblib.readthedocs.io/en/latest/ Seems promising.

lfdversluis commented 4 years ago

Perhaps investigating if the XML file and the JSON files of Semantic Scholar / AMiner can be processed at an item-level parallelization might me interesting. With joblib linked above, file-level parallelization becomes possible, yet the JSON files are structured in such a way that each line in the file is one (standalone) JSON object. Perhaps parsing these in parallel is even faster.

lfdversluis commented 4 years ago

Setting up some benchmarks + regression tests might be a nice idea as well.