datawizard1337 / ARGUS

ARGUS is an easy-to-use web scraping tool. The program is based on the Scrapy Python framework and is able to crawl a broad range of different websites. On the websites, ARGUS is able to perform tasks like scraping texts or collecting hyperlinks between websites. See: https://link.springer.com/article/10.1007/s11192-020-03726-9
GNU General Public License v3.0
87 stars 25 forks source link

Getting started with ARGUS #53

Closed DioLimpens closed 2 years ago

DioLimpens commented 2 years ago

Dear all,

As I am a novice in webcrawling and not experience in python, I request your help. I might want to make use of ARGUS for my Master thesis, but do not get it working. I have followed all steps in the readme and reinstalled everything even. Yet, when starting the crawler through by either launching the no_GUI or the ARGUS.py file in the command prompt (as the GUI won't start for me) and testrun on the url list in the misc folder, it seems that the spider starts but never ends (I am not certain on this) and never shows up in the scrapyd web interface, as none appear under pending, running, or finished.

So far: I've checked:

I hope you are able to help me further as I am much interested in learning more about webmining, and programming in general. If you require more information or the output I receive, please let me know.

Kind regards,

Dio

datawizard1337 commented 2 years ago

Hey Dio, would you be so nice and post a screenshot of your command line interface after launching the scraping?

DioLimpens commented 2 years ago

Dear Jan, Thanks for the quick reply. Hereby the screenshots attached, I preloaded scrapyd before launching ARGUS_noGUI with the example url-list in misc. Hopefully allows this to get an insight on where I am making mistakes. I seems to me that it has an error on the command -list-, yet this would be standard recognized by python and works fine in the shell. Thank you in advance for your aid.

ARGUS_noGUI run Scrapyd server output Scrapyd website output

datawizard1337 commented 2 years ago

Have you tried to start the scrapyd server from inside the ARGUS directory? Navigate to the directory and then launch your cmd by, for example, clicking the path in the navigation bar, enter "cmd" and enter. image

davidlenz commented 2 years ago

Some further reading on this:

DioLimpens commented 2 years ago

Thanks both! Indeed with running scrapyd from the ARGUS directory resolved the first issue, after adding the anaconda paths to the system path. I was running everything from the Anaconda cmd prompt. For my understanding and learning: this has to do with the configuration and settings files for scrapy in this directory?

Furthermore, did the post-processing result in an error (Exception in Tkinter callback) similar to the issue "ebergam" had in February. His addition subprocess.run(r"TSKILL scrapyd", shell=True) solved this problem as well for me.

datawizard1337 commented 2 years ago

Yes, if you are running scrapyd outside of the ARGUS directory, scrapy won't find the project files. Happy that we could help you.