aaronsw / html2text

Convert HTML to Markdown-formatted text.
http://www.aaronsw.com/2002/html2text/
GNU General Public License v3.0
2.63k stars 414 forks source link

Add an option to allow pure text to be returned (ignore page breaks, etc.) #68

Open jacebrowning opened 11 years ago

jacebrowning commented 11 years ago

I am using html2text to store LaTeX syntax in a Google Doc and later retrieve it for processing. For this to work there cannot be any special characters in the returned text; the text needs to be returned exactly as it appears in the source text. It appears that html2text inserts "* * *" for page breaks.

Example:

html2text --google-doc --ignore-emphasis https://docs.google.com/document/d/.../pub?embedded=true

Actual Output:

% start of document

\begin{document}

* * *

% abstract

\newpage

\begin{abstract}

Abstract goes here...

\end{abstract}

* * *

% end of document

\end{document} 

Desired Output:

% start of document

\begin{document}

% abstract

\newpage

\begin{abstract}

Abstract goes here...

\end{abstract}

% end of document

\end{document}