Is it possible to parse archived blogs with a new host?

bohdanbobrowski / blog2epub

Convert blog (blogspot.com, wordpress.com...) or any website to epub using GUI, CLI or Python.

https://github.com/bohdanbobrowski/blog2epub

MIT License

40 stars 6 forks source link

Is it possible to parse archived blogs with a new host? #18

Closed ctnoir closed 1 week ago

ctnoir commented 4 months ago

Thanks for this software! I have used it successfully to archive a lot of older blogs for offline reading on my phone.

I've been trying to get it to work on this archived blog, which was originally hosted on blogspot: https://thearchdruidreport-archive.200605.xyz/2017/05/index.html

It should have the same folder and link structure, but I get an error 403 when trying to transform into an epub.

Is there anything I can do about this issue?

bohdanbobrowski commented 4 months ago

First of all, I would like to point out that this is my private project, which I have been developing for several years - with varying degrees of success - but such comments certainly give me a lot of motivation to continue working. Thank you!

Answering to main question: such scraping should be possible, but I haven't tested it - so I'm not surprised you've got an error. I will leave this issue opened and I will try to implement it soon. I'm currently working on version 1.3.0 (there is a branch) which will bring a lot of changes to the UX as well as a code refactor that will allow to implement further changes easier and faster.

ctnoir commented 4 months ago

I live in a rural area with intermittent internet, so it is really nice for me to be able to turn websites into epubs. I did this manually in the past with wget and then using the raw html files in Calibre to create something readable on my phone, but it was always time consuming arranging things by hand. Plus sometimes the comments would not be captured, and that's where a lot of really good insight is.

So this software is an absolute godsend for me and I was so happy to discover it one day. I have created around 20 epubs so far with it and they all work perfectly with no additional work needed on my side. I can now happily read about the history of Western civilisation or about small town life without needing any additional internet connection in the woods :D Thank you so much for your work on this project.

bohdanbobrowski commented 1 week ago

I've recently pushed a bunch of changes to the dev branch that will be released soon in version 1.5.0, one of them is the ability to download even archived blogs... it's still very imperfect, but I might refine this mechanism in future versions... obraz