feature request: import html pages

jolars commented 4 years ago

Please consider implementing support for importing html pages, in particular html pages from wikpedia. The now abandoned incrementral reading addon for Anki had support for extracting html pages from the web and then extracting texts, highlighting, and creating clozes from text. Perhaps it would be possible to borrow code from that package?

fonol commented 4 years ago

Any concrete suggestions how this workflow should go? I don't remember exactly how the IR add-on operated, did it show the imported webpage as text only, or did it display it as html? I am open to include more import functionality, so feel free to send me your ideas.

You can already import webpages as pdfs btw, click on Notes -> Url to PDF (you have to set a folder to save to in the settings first).

jolars commented 4 years ago

Perhaps allowing URL input in the Source field in Notes > Create? The incremental reading add-on had a separate shortcut that opened up a dialogue where the user was prompted to enter a URL. It, however, had a different setup for the "notes" that were added, treating them as regular Anki cards but with a separate queue system similar to the one you have developed for this add-on.

did it show the imported web page as text only, or did it display it as HTML?

It displayed (and imported) it as HTML. It did not properly import images, however, and failed to handle cross-references within, for instance, the Wikipedia page.

It would be nice if it was possible to import math as mathml/mathjax as well, but I'm not sure how that would work in practice.

You can already import webpages as pdfs btw, click on Notes -> Url to PDF (you have to set a folder to save to in the settings first).

Thanks, yes I noticed this, very nice. I believe, however, that it might be easier to work with HTML rather than PDF.

fonol commented 4 years ago

Perhaps allowing URL input in the Source field in Notes > Create

That could be an idea. But maybe with a prefix, like url:, as I don't want to remove the possibility to specify urls as regular sources. I'll look out for a python lib on top of BeautifulSoup that can maybe archive whole pages.

jolars commented 4 years ago

Yes, or perhaps an option URL to HTML in the drop-down list for Notes would be more suitable since it probably belongs together with URL to PDF

fonol commented 4 years ago

With the last update, there is now a really basic import function in the "Create" dialog. It basically fetches just the body of the given page, and cuts everything non-textual. There comes a lot of noise with the content, but I don't really know what to do about that.

fonol / anki-search-inside-add-card

feature request: import html pages #69