databrickslabs / tika-ocr

Other
17 stars 2 forks source link

Scalability issues when storing binary file in pyspark column #44

Open aleksandrskrivickis opened 3 months ago

aleksandrskrivickis commented 3 months ago

Dear @aamend @alexott @nfx, I appreaciate your work on making tika file format possible.

After reviewing serialiser code I have noticed you storing binary file as one of the columns.

Such a construct does not allow stable flow at a scale of more than 1000 large documents.

It could be prudent to store binary files outside of result dataframe.

Let me know your thoughts.

arcaputo3 commented 1 month ago

It should be straightforward to add an option to ignore the content column, but Tika still requires having the entire binary in memory to do OCR, so IMO memory is the bottleneck and not storage.

We've had success using small partitions and high-mem clusters for larger workflows. You can set spark.conf.set("spark.sql.files.maxPartitionBytes", 4194304) which attempts to reduce the partition size to 4MB vs. spark default of 128MB.