Open lfdversluis opened 4 years ago
https://joblib.readthedocs.io/en/latest/ Seems promising.
Perhaps investigating if the XML file and the JSON files of Semantic Scholar / AMiner can be processed at an item-level parallelization might me interesting. With joblib linked above, file-level parallelization becomes possible, yet the JSON files are structured in such a way that each line in the file is one (standalone) JSON object. Perhaps parsing these in parallel is even faster.
Setting up some benchmarks + regression tests might be a nice idea as well.
Each sub-set of data and each data source can be processed in parallel. Dask can be used to parallelize this.