Method: tika
Documents Processed: 2172
Total Output Size: 5463636 bytes
Errors Encountered: 21
Time Taken: 10.783208847045898 seconds
Method: html2text
Documents Processed: 2172
Total Output Size: 6333919 bytes
Errors Encountered: 0
Time Taken: 8.442482948303223 seconds
Multi-threaded Statistics:
Method: tika
Documents Processed: 2172
Total Output Size: 5463636 bytes
Errors Encountered: 21
Time Taken: 3.171293020248413 seconds
Method: html2text
Documents Processed: 2172
Total Output Size: 6333919 bytes
Errors Encountered: 0
Time Taken: 8.368132829666138 seconds
Finally:
Rust Conversion Statistics:
Documents Processed: 2171
Total Output Size: 19940427 bytes
Time Taken: 0.68 seconds
(had to use a different processing method).
The size difference is because the rust implementation kept URLs, and I couldn't figure out how to remove them.
Note the rust version is multithreaded using rayon. It takes a list of filenames as input.
Anyway -- almost identical functionality but ~12x faster than the current implementation.
It would be really cool if this could be implemented as part of this library so everyone doing html2text in python has the option to 12x their conversion speed...?
I have to convert ~400m html files to txt for a thing, and I wanted to figure out the most efficient method. I ran a bunch of benchmarks using html2text (this library), html2text the C++ library (https://github.com/grobian/html2text), tika (i.e. https://pypi.org/project/tika/), and finally the Rust library of the same name (https://docs.rs/html2text/latest/html2text/).
Here are the results I got:
Single-threaded Statistics:
Method: tika Documents Processed: 2172 Total Output Size: 5463636 bytes Errors Encountered: 21 Time Taken: 10.783208847045898 seconds
Method: html2text Documents Processed: 2172 Total Output Size: 6333919 bytes Errors Encountered: 0 Time Taken: 8.442482948303223 seconds
Multi-threaded Statistics:
Method: tika Documents Processed: 2172 Total Output Size: 5463636 bytes Errors Encountered: 21 Time Taken: 3.171293020248413 seconds
Method: html2text Documents Processed: 2172 Total Output Size: 6333919 bytes Errors Encountered: 0 Time Taken: 8.368132829666138 seconds
Finally:
Rust Conversion Statistics: Documents Processed: 2171 Total Output Size: 19940427 bytes Time Taken: 0.68 seconds
(had to use a different processing method).
The size difference is because the rust implementation kept URLs, and I couldn't figure out how to remove them.
Note the rust version is multithreaded using rayon. It takes a list of filenames as input.
Anyway -- almost identical functionality but ~12x faster than the current implementation.
It would be really cool if this could be implemented as part of this library so everyone doing html2text in python has the option to 12x their conversion speed...?