Static harvests prohibitively slow with spark .wholeTextFiles

MI-DPLA / combine

Combine /kämˌbīn/ - Metadata Aggregator Platform

MIT License

26 stars 11 forks source link

Static harvests prohibitively slow with spark .wholeTextFiles #143

Closed ghukill closed 6 years ago

ghukill commented 6 years ago

Running a static harvest over 200k+ records reveals that this approach for reading files does not scale well:

# read directory of static files
static_rdd = spark.sparkContext.wholeTextFiles(
    'file://%s' % kwargs['static_payload'],
    minPartitions=settings.SPARK_REPARTITION
)

Some cursory research shows that Spark does not handle a large amount of small files well, at least not without some tuning.

Investigate alternatives.

ghukill commented 6 years ago

In addition to reworking static files, it might be worth simultaneously thinking about other ways of harvesting/importing data, as that might have bearing.

For example, what about a more open ended, programmable endpoint? the ability to connect to a server or script that can operate in any fashion, but returns data in a way that Spark is able to efficiently handle? We know that OAI-PMH is a popular source, but Combine might not be the place to attempt to handle the myriad of formats, structures, and protocols.

ghukill commented 6 years ago

This is currently being addressed with rewrite of static harvesting, using Spark-XML. Closing.