Problems trying to use the project

AntonioFuziy commented 2 years ago

Hi Andreas, I was searching for some reinforcement learning projects using a web crewler, and i found yours on github which seems to be the most complete one but I found some bugs following the README instructions, I am trying to use it for a personal project for business research and I saw that in this project you have some specific topics for the web crawler such as Business, Science and others, but I was trying to use the Business topic for my research and I got some bugs and problems trying to use your project, if possible can you help me trying to solve this issues and explain more about how the project works.

ddaedalus commented 2 years ago

Hello,

What is your exact problem, can you explain more?

AntonioFuziy commented 2 years ago

Thank you for the reply.

I am trying to use the Business topic for the crawler and for it I changed the taxonomy.py file to search for Business topics as well as the config.py file, I used some seeds and data on the files folder that you recomend to download, but at the end when I execute the keyword_extract.py I dont know if the crawler worked, because after trained the model and executed the run_crawling.py an error message appeared saying that the model folder wasnt found on the KwBiLSTM folder.

Screenshot from 2022-07-19 15-16-58

If you it wouldn`t be a problem, can we schedule a video call on discord, google meets, zoom or any other platform?

ddaedalus commented 2 years ago

I think the issue bug is now fixed. After you run "run_classification.py", the "KwBiLSTM" folder will be created containing the saved model. Then you should run "run_crawling.py". The problem was in the path variable of the saved model, which was set to "model" instead of "KwBiLSTM" (remaining from a previous code version).

I also urge you to give your own data for the training (classification.py) of KwBiLSTM, because the "Business.pickle" you downloaded, contains random Business URLs from DMOZ, which however has not been maintained for a long time and a lot of web pages do not exist anymore. However, I use these URLs (from the files of the drive) for representing the irrelevant (negative) class of the classification (if the given "domain" is different) and thus this is not a problem in this case. If you use these data for your relevant (positive) class, it is likely that the KwBiLSTM is trained well. Otherwise, I prompt you to open (with python using pickle) the Business.pickle file (which is a python dict in the form of key=url text, value=page text), if you do not have provided your own data, and check if these web page texts really give relevant word collection (to construct your data).

If there is another problem, please inform me.

Thanks for the good feedback.

AntonioFuziy commented 2 years ago

thanks for the reply, it helped me a lot, but I am just struggling on how to change the Business.pickle to include the URLs that I want and if I change the taxonomy.py file would I be able to find the words that I provided as input on this file? I got a little bit confused with it because on the README file you said that I need to change the seeds.txt and data.txt files, but aren`t these URLs included on the Business.pickle file?

ddaedalus commented 2 years ago

There are 3 files you should modify.

You should provide your own keywords in the taxonomy.py.
Also, you should provide some URLs (800-1000 ideally) in the data.txt (each URL in a line without a comma). Understanding that this may be large number, you can find a lot of URLs related to Business on DMOZ or Curlie (e.g. there exist DMOZ datasets on Kaggle, or other DMOZ data with a Google search, or Wikipedia which provides good texts). Another option, as I said above, is to find relevant URLs in the pickle files (which include URLs from DMOZ). To open a pickle, you should use the pickle module and the method load. Finally you should copy your URLs to the data.txt. Bear in mind that in the DMOZ (thus including the pickle files) many URLs contain no relevant texts (because of not found errors) and you should always check what you copy to the data.txt.
Finally, you must provide a few seed URLs in the seeds.txt (of course you can provide as much as you like).

ddaedalus / tres

Problems trying to use the project #1