AlexNeagu123 / FileHippo-Web-Crawler

Web Crawler that extracts data about the products available to download from FileHippo
MIT License
0 stars 0 forks source link

How to run the crawler #1

Open beshoo opened 3 months ago

beshoo commented 3 months ago

?? i wat to get all data form the website!

AlexNeagu123 commented 3 months ago

Hello beshoo. I've implemented this as a task two years ago and the main goal was to learn how to make a web crawler and how to use Databases and make API's. This is why I have three different choices for databases (sqlite3, postgres (with psycopg library or peewee ORM). I just wanted to do the same thing with 3 different approaches. The content inserted into the databases is the same for any choice. I didn't actually think that this Crawler may be actually useful for somebody. If you want to run the crawler only it can be done tho. You should run the main.py file inside the crawler folder. Before you run it you should install peewee and psycopg (I didn't make a requirments.txt then). Technically you should create a local database, the simplest version is by using SQLite. You can create a new SQLite database "products.db" and a "products" table inside of it. The products table should have the following fields: Name, Version, Languages, DateAdded, Final Download Link, Size, Filename (all of Text type). After you created the database, on line 23 in main.py you should set the db_path argument as the absolute path of the folder that contains "products.db" and be sure that db_name="products" Another step is to create a file named "cache.json" somewhere in your system and include the absolute path of the file in config.py for the cache_file variable. One final final remark is that I terminate the program after a specific threshold of requests were made. You can change that threshold on line 35 in main.py (it is 10 now). You can't make it too big because the site gives cooldown after 100 requests or something. Although, you can run the program multiple times and the database will be updated with new information (this is the role of the cache file). After all of this, you can just run the crawler/main.py and type sqlite3 after the prompt. After the program terminates you can see the fresh added data in the "products.db" sqlite3 database. Also, you can see additional information on what is happening by checking the "log_file.json" file.

Again, I did all this stuff only for learning purposes and I didn't prepare it for further use. I hope you find this useful. Alex.