dipu-bd / lightnovel-crawler

Generate and download e-books from online sources.
https://pypi.org/project/lightnovel-crawler/
GNU General Public License v3.0
1.41k stars 275 forks source link

Create an analyzer to build crawlers automatically #501

Closed dipu-bd closed 1 year ago

dipu-bd commented 4 years ago

It will take an toc url as input, and will provide a scrapy like shell environment to select elements on the page. You need to tell it how to find title, cover url, volume list, chapter list etc, and it will generate a python file that can be used as a scraper.

dipu-bd commented 4 years ago

Added a very basic analyzer.

To start, run python . analyze

A scraper file will be auto-generated. And you will get a command line interface for further modification and testing.

? Please enter an URL: https://wuxiaworld.world/ascenders-rift
GET https://wuxiaworld.world/ascenders-rift
Status: 200 - OK

host = wuxiaworld.world
scraper_url = https://wuxiaworld.world/
scraper_name = WuxiaworldWorld
scraper_path = C:\Users\Dipu\Projects\lightnovel-crawler\lncrawl\sources\generated\wuxiaworld-world.py
Generated: C:\Users\Dipu\Projects\lightnovel-crawler\lncrawl\sources\generated\wuxiaworld-world.py

>>>

Here are some available commands:

Command What it does
exit Exit this interface
clear Clear console
help Displays help
view / ls See list of selectors you can use
locate Locate an item by text or attribute value
modify Modify already saved selector
set_url Change the current url
generate Genrate source file with current selectors

You can auto-generate css selectors for a text using the locate method

image image

dipu-bd commented 3 years ago

Should implement #1001 before using this

damare01 commented 3 years ago

@dipu-bd is this feature already available for use?

damare01 commented 3 years ago

When I run this command python . analyze I get this errorC:\Program Files\Python39\python.exe: can't find '__main__' module in 'C:\\Users\\user\\OneDrive\\Desktop\\lightnovel-crawler'

dipu-bd commented 3 years ago

this feature is experimental. I made it a year ago. it is not ready to use yet. I plan to work on it again sometimes later.

damare01 commented 3 years ago

okay, thank you!

dipu-bd commented 1 year ago

After the new changes to dev now it is possible to create crawlers automatically using the existing templates.

A guide to how to do it:

(venv) $ python -m lncrawl --bot lookup
================================================================================
╭╮╱╱╱╱╱╱╭╮╱╭╮╱╱╱╱╱╱╱╱╱╱╱╱╭╮╱╭━━━╮╱╱╱╱╱╱╱╱╱╭╮
┃┃╱╱╱╱╱╱┃┃╭╯╰╮╱╱╱╱╱╱╱╱╱╱╱┃┃╱┃╭━╮┃╱╱╱╱╱╱╱╱╱┃┃
┃┃╱╱╭┳━━┫╰┻╮╭╋━╮╭━━┳╮╭┳━━┫┃╱┃┃╱╰╋━┳━━┳╮╭╮╭┫┃╭━━┳━╮
┃┃╱╭╋┫╭╮┃╭╮┃┃┃╭╮┫╭╮┃╰╯┃┃━┫┃╱┃┃╱╭┫╭┫╭╮┃╰╯╰╯┃┃┃┃━┫╭╯
┃╰━╯┃┃╰╯┃┃┃┃╰┫┃┃┃╰╯┣╮╭┫┃━┫╰╮┃╰━╯┃┃┃╭╮┣╮╭╮╭┫╰┫┃━┫┃
╰━━━┻┻━╮┣╯╰┻━┻╯╰┻━━╯╰╯╰━━┻━╯╰━━━┻╯╰╯╰╯╰╯╰╯╰━┻━━┻╯
╱╱╱╱╱╭━╯┃ v3.0.0
╱╱╱╱╱╰━━╯ 🔗 https://github.com/dipu-bd/lightnovel-crawler
--------------------------------------------------------------------------------

➡ Press  Ctrl + C  to exit

? Enter novel page url: https://www.novelmtl.com/novel/ancient-true-dragon-art.html
🍀 Checking MadaraTemplate [1 of 2] 
  ➡ Create instance : success
  ➡ initialize() : success
  ➡ read_novel_info() : failed
    No title found

🍀 Checking NovelMTLTemplate [2 of 2] 
  ➡ Create instance : success
  ➡ initialize() : success
  ➡ read_novel_info() : success
  ➡ download_chapter_body() : success

? Enter language: en
? Does it contain Manga/Manhua/Manhwa? No
? Does it contain Machine Translations? No

📦 Generated source file 📦
➡ /home/dipu/projects/lightnovel-crawler/sources/en/n/novelmtl.py

If you are confused about the language, check the language_codes.