danburzo / percollate

A command-line tool to turn web pages into readable PDF, EPUB, HTML, or Markdown docs.
https://danburzo.ro/projects/percollate/
MIT License
4.27k stars 165 forks source link

Text only output #181

Closed winstonma closed 1 month ago

winstonma commented 1 month ago

First thanks for the package. I really like it.

Feature description

I would like to feed the article to edge-tts with text only (remove image url and heading) as listener would listen to the text only. The following code would be my desired demo:

# Just a demo url
ARTICLE_URL="https://stratechery.com/2024/integration-and-android/"
# Output the result on screen
percollate text -o - $ARTICLE_URL | cat

Thank again for this package

Existing workarounds

I am using pandoc to convert markdown to text

# Just a demo url
ARTICLE_URL="https://stratechery.com/2024/integration-and-android/"
# Output the result on screen
percollate md -o - $ARTICLE_URL | pandoc -t plain | cat

The output is, due to pandoc conversion, was outputted as multiple lines. When the text is fed into edge-tts the audio would be spoken with weird gap.

Is there any way to obtain the desired effect with the current functionality?

danburzo commented 1 month ago

Hi @winstonma! My idea around supporting Markdown was that percollate would be used along other tools for further processing, and piping into pandoc is exactly the kind of thing that I would expect for converting to plain text. I understand, however, that its plain-text formatter produces hard-wrapped paragraphs which, as you said, are unsuitable for TTS.

I wrote a separate CLI tool for handling HTML/MD/Plain text called trimd. Its output may be closer to what you’re looking for. You may try either percollate html | trimd demarkup for HTML to text or percollate md | trimd demarkdown for MD to text.

winstonma commented 1 month ago

Thanks a lot. I think the pure text output is better than pandoc. That's works in my scenario. Thanks again for your tool.