babluboy / bookworm

A simple ebook reader for Elementary OS
GNU General Public License v3.0
1.32k stars 100 forks source link

Search doesn't work on Arch Linux #260

Closed nortexoid closed 5 years ago

nortexoid commented 5 years ago

When I search for a simple string such as "Hume", I receive no search results even though Hume occurs probably 100 times in the epub file. The same is true for any other expression I've tried. The package was installed from Arch's community repos (not the AUR).

babluboy commented 5 years ago

Thanks for reporting the issue. Can you please provide debug details by running Bookworm in debug mode and running the search functionality.

To start Bookworm in debug mode, open terminal and run the command: com.github.babluboy.bookworm —debug

nortexoid commented 5 years ago

When I start bookworm with debug, I get a warning that says "Problem in extracting contents of book. Ensure there is a valid ebook file here", even before I open a book. When I attempt a search for 'Hume', the terminal outputs a bunch of instances of:

[WARNING 10:08:14.793538] utils.vala:106: Execution of sync command [/usr/share/bookworm/scripts/tasks/com.github.babluboy.bookworm.search.sh "/home/michael/.config/bookworm/books/(Muirhead Library of Philosophy 88) Moore George Edward - Some Main Problems of Philosophy-Routledge (2004).epub/OEBPS/001_9781315830537_halftitle.html" "Hume"]: exited with non zero error code[256]. Error message:/usr/share/bookworm/scripts/tasks/com.github.babluboy.bookworm.search.sh: line 8: html2text: command not found

It seems html2text is not a dependency of the arch package, so I installed python-html2text. Now the search runs but I still get no results because of the following warning:

[WARNING 10:18:54.386988] utils.vala:106: Execution of sync command [/usr/share/bookworm/scripts/tasks/com.github.babluboy.bookworm.search.sh "/home/michael/.config/bookworm/books/(Muirhead Library of Philosophy 88) Moore George Edward - Some Main Problems of Philosophy-Routledge (2004).epub/OEBPS/032_9781315830537_index.html" "Hume"]: exited with non zero error code[256]. Error message:Usage: html2text [(filename|url) [encoding]]html2text: error: no such option: -u

When I look at the script, I see the html2text option -utf8, but not -u, so I'm not sure why it's giving that warning. The script contents is:

HTML_CONTENT_TO_BE_SEARCHED=$1
USER_SEARCH_TEXT=$2
html2text -utf8 "$HTML_CONTENT_TO_BE_SEARCHED"  | tr '\n' ' ' | grep -E -o -i  ".{0,50}$USER_SEARCH_TEXT.{0,50}"
babluboy commented 5 years ago

Thanks for the debug details and your follow up. However, bookworm needs the linux html2text and not the python-html2text.

This is the man page for html2text : https://manpages.debian.org/testing/html2text/html2text.1.en.html

This package basically converts the html code to plain text i.e. strip all html tags - if you can find a arch package which does that then you can update the script accordingly with the right command/parameters. Once the html content is available as plain text, the search for the text is done with the output of the html2text piped to the grep command

babluboy commented 5 years ago

This looks to be an alternative for Arch: https://aur.archlinux.org/packages/txt2html/

However, I'm not sure what are the parameters but worth a try. If the command can output the content of the html in plain text (with all the tags removed) then the same can be piped to trand grep to search for the text

If nothing works then the html2text can be removed from the script, however the search results can contain an match on a html tag i.e color can be match to the tag in addition to the word color in the contents

nortexoid commented 5 years ago

Thanks very much for the replies. Installing the pdf-html2text package did install html2text as well--at least, when I type "html2text --help" it gives me a list of options. But I see these options differ from what's listed on the man page for html2text. Anyway, when I run it on an html file it outputs a txt version of it, so it works. When I remove the -utf8 option from the script, search works, so that seems to be the easiest way to fix it. The arch package maintainer should add python-html2text as a dependency and remove -utf8 from the script command.

Tanuja878 commented 5 years ago

@nortexoid Many thanks for the investigation and the solution. I have opened a bug for bookworm on the Arch Bug List, hopefully the maintainer will update the script to remove the -utf8 option.

https://bugs.archlinux.org/task/62670

nortexoid commented 5 years ago

Great, thanks!

anatol commented 5 years ago

Hi folks, honestly I am bit confused about the version of html2text to be used. Arch Linux has only one version - the one that comes from python-html2text.

I am not aware about different incompatible versions of html2text tools. Could anyone shed some light why we have multiple html2text versions? And how bookworm project chosen one over another?

babluboy commented 5 years ago

@anatol I had chosen html2text for debian on Bookworm as Bookworm targetted Elementary OS which is based on Ubuntu. Even I'm surprised why the html2text for Arch does not have the -utf8 option, so if the Arch build removes the -utf8 option then it will work for Arch.

On debian, HTML2TEXT uses ISO 8859-1 by default, so I use UTF-8 to ensure search work on chars from other languages in addition to english.

babluboy commented 9 months ago

Could you please check if html2text is available on your Arch OS? The search function requires the utility. This is the instruction to install html2text on Ubuntu: https://installati.one/install-html2text-ubuntu-20-04/