Alir3z4 / html2text

Convert HTML to Markdown-formatted text.
alir3z4.github.io/html2text/
GNU General Public License v3.0
1.74k stars 266 forks source link

Considered using the rust library html2text under the hood? #401

Closed mpr1255 closed 10 months ago

mpr1255 commented 10 months ago

I have to convert ~400m html files to txt for a thing, and I wanted to figure out the most efficient method. I ran a bunch of benchmarks using html2text (this library), html2text the C++ library (https://github.com/grobian/html2text), tika (i.e. https://pypi.org/project/tika/), and finally the Rust library of the same name (https://docs.rs/html2text/latest/html2text/).

Here are the results I got:

Single-threaded Statistics:

Method: tika Documents Processed: 2172 Total Output Size: 5463636 bytes Errors Encountered: 21 Time Taken: 10.783208847045898 seconds

Method: html2text Documents Processed: 2172 Total Output Size: 6333919 bytes Errors Encountered: 0 Time Taken: 8.442482948303223 seconds

Multi-threaded Statistics:

Method: tika Documents Processed: 2172 Total Output Size: 5463636 bytes Errors Encountered: 21 Time Taken: 3.171293020248413 seconds

Method: html2text Documents Processed: 2172 Total Output Size: 6333919 bytes Errors Encountered: 0 Time Taken: 8.368132829666138 seconds

Finally:

Rust Conversion Statistics: Documents Processed: 2171 Total Output Size: 19940427 bytes Time Taken: 0.68 seconds

(had to use a different processing method).

The size difference is because the rust implementation kept URLs, and I couldn't figure out how to remove them.

Note the rust version is multithreaded using rayon. It takes a list of filenames as input.

Anyway -- almost identical functionality but ~12x faster than the current implementation.

It would be really cool if this could be implemented as part of this library so everyone doing html2text in python has the option to 12x their conversion speed...?

mpr1255 commented 10 months ago

https://github.com/mpr1255/html2text_rs_py