Closed ghukill closed 6 years ago
In addition to reworking static files, it might be worth simultaneously thinking about other ways of harvesting/importing data, as that might have bearing.
For example, what about a more open ended, programmable endpoint? the ability to connect to a server or script that can operate in any fashion, but returns data in a way that Spark is able to efficiently handle? We know that OAI-PMH is a popular source, but Combine might not be the place to attempt to handle the myriad of formats, structures, and protocols.
This is currently being addressed with rewrite of static harvesting, using Spark-XML. Closing.
Running a static harvest over 200k+ records reveals that this approach for reading files does not scale well:
Some cursory research shows that Spark does not handle a large amount of small files well, at least not without some tuning.
Investigate alternatives.