eafer / rdrview

Firefox Reader View as a command line tool
Apache License 2.0
844 stars 36 forks source link

Convert to text #12

Open fetchinson opened 3 years ago

fetchinson commented 3 years ago

Hi, rdrview is absolutely fantastic! The fastest and most relevant output I've come across from all the firefox readability based tools I've tried.

One new feature would be a great addition I think: convert the readable html output to text. Right now I'm using rdrview to get the readable html, output it with "-H" and use the links or lynx browser to dump the formatted text with the -dump option.

Would be nice if rdrview would have an option for outputting text.

In any case, thanks a lot for rdrview! (By the way I also had to throw away the sandbox stuff from the code because libseccomp would not compile on my system.)

eafer commented 3 years ago

Hi, rdrview is absolutely fantastic! The fastest and most relevant output I've come across from all the firefox readability based tools I've tried.

Thanks! I'm glad to hear it.

Would be nice if rdrview would have an option for outputting text.

You can use mailcap for this purpose. Create a file under ~/.mailcap with a line such as the following:

text/html; /usr/bin/lynx -dump -force_html %s; copiousoutput; description=HTML Text; nametemplate=%s.html

This is the default on Debian (if lynx is installed). I wished sane mailcap defaults were more common, they are very useful.

(By the way I also had to throw away the sandbox stuff from the code because libseccomp would not compile on my system.)

Can you share any more details here? What's your system? If libseccomp is not always available I should do something to simplify the build in those cases. Or maybe just give up and use autoconf.

fetchinson commented 3 years ago

(By the way I also had to throw away the sandbox stuff from the code because libseccomp would not compile on my system.)

Can you share any more details here? What's your system? If libseccomp is not always available I should do something to simplify the build in those cases. Or maybe just give up and use autoconf.

I have a very old fedora 17 installation, about 8 years old, and there are no updates anymore provided by redhat. I compile almost everything from source and the only time I run into trouble is if my glibc is too old and the code I'm trying to compile relies on newer glibc features, which does happen sometimes. With libseccomp I couldn't compile it, but it wasn't a glibc related problem, it through

system.c:461:16: error: ‘__NR_seccomp’ undeclared (first use in this function)

and after looking at the code for a while and googling around I couldn't figure out where __NR_seccomp should come from. So I gave up on libseccomp, but could easily compile your code by simply deleting everything which was sandbox related.

By the way, what's the downside of running it without a sandbox?

eafer commented 3 years ago

I couldn't figure out where __NR_seccomp should come from.

That's the syscall number for seccomp(), it comes from the kernel headers. It was introduced in 3.17, so your kernel probably doesn't have it at all. Some of the seccomp stuff can also work with prctl(), but I guess libseccomp doesn't support that. If you want to build it, you need to upgrade to a more recent kernel.

By the way, what's the downside of running it without a sandbox?

The sandbox is a security measure, in case there are exploitable bugs in my code. Since you are using such an old distro, I'm guessing security is not a concern for you, so you can ignore it.

If you want, you can try to secure rdrview by runnning it as a separate unprivileged user. It's not the same as the sandbox, but it's probably good enough.

fetchinson commented 3 years ago

Okay, thanks, you're right, security is not really an issue in my setup. In the Makefile you could introduce a setting to have the sandbox not compile at all if it's a problem for other people too. But like I said, it's easy to just delete those parts of the code which refer to the sandbox, so it's not a big issue.

eafer commented 3 years ago

You can use mailcap for this purpose. Create a file under ~/.mailcap with a line such as the following:

text/html; /usr/bin/lynx -dump -force_html %s; copiousoutput; description=HTML Text; nametemplate=%s.html

Did you try this? Did it work for you? I ask so that I can close the issue.

sdsddsd1 commented 3 years ago

I had the same issue and was looking for an option to output text directly. (Easy way to scroll) Adding text/html; /usr/bin/lynx -dump -force_html %s; copiousoutput; description=HTML Text; nametemplate=%s.html to $HOME/.mailcap is a good solution for me. Maybe add this also the documentation?

csehszlovakze commented 1 year ago

I'd also like an easy option to have plain text output. Right now I have to pipe the outputted HTML into html2text, strip out the formatting marks then use that to print+TTS.

rjolina commented 4 months ago

But rdrview can easily convert html it into text.

$ rdrview "https://lite.cnn.com/alzheimers-risk-test-sanjay-gupta/index.html" > text.txt

$ cat text.txt

   Updated: 2:00 AM EDT, Sun May 19, 2024

   Source: CNN

   I’ve been reporting on Alzheimer’s disease for more than two decades, and
   any progress in the field has seemed incremental at best, leaving most
   patients and their loved ones with few options. But in the process of
   filming a new documentary, “The Last Alzheimer’s Patient,” I met with
   people all across the country who had been diagnosed with or who are at
   high risk of the disease. With lifestyle changes alone, I saw levels of
   amyloid plaque decrease in their brains, their cognition improve and even
   signs of reversal of the disease.

   It was extraordinary and it also made me start to think about my own
   brain, because I have a family history of Alzheimer’s disease.

   So with some trepidation, I decided to learn more about my risk for
   dementia. It was one of the most personal and revealing experiences I have
   ever gone through.
eafer commented 3 months ago

@rjolina Yes, it works fine out of the box if you use debian or some other distro that sets up mailcap. Rdrview only cleans up the html, then a browser is needed to render that into text (lynx for example). The mailcap files tell us which browser to use, otherwise it can also be picked with the -B option. Of course it would be easier for the users if rdrview did everything by itself, there would be less configuration involved. But this is a small project and rendering html is probably at least somewhat tricky.