28.05.2023

[x] Facebook is unpredictable! The first page sometimes gives this:
[x] I found that facebook disallow crawling :( https://www.facebook.com/robots.txt
[x] One extra use case is for testing the invalid links inside the website!
[ ] Test copy rights!

27.05.2023

[x] Started on the second milestone
[x] Start crawling Facebook posts.
- [x] They use modification, making the classes hard to read.
- They generate different ids and classes on each load, and they are unreadable (.x193iq5w .xeuugli)
- They need to add a waiting event and also scrolling also, we do not want to collect URLs
- Add before and after actions
- Add find text to find hate speech!
- I really really like the selector feature that ParseHub is using it can detect which is the perfect selector that can be used for your element.
- Facebook is complaining that I am using an unsupported browser :(
- [x] Updated Chrome/22.0.1216.0 to Chrome/110.0.5481.77
- [ ] I should allow users to upgrade their browser versions to avoid this issue in the future.
- [x] I can crawl one post, yes!
- [x] I need to add a controlled workflow like actions.
- [x] Added the actions in the UI and the DB but the have no affect yet!

First milestone 29.05.2023

[x] Can submit a crawling job from GUI 2 days
[x] Simple Job monitoring and controlling 2 days
[x] Basic Index 2 days
[x] Index from GUI 2 days
[x] Simple Result page 2 days

22.05.2023

[x] Add a timeline to the status of the crawlers.
21.05.2023
Try crawling and indexing a new website and list what should be changed and how to solve them.
- douglas.de is the next step.
What differentiates this thesis?
- Crawling + indexing (No other software is doing it as far as my search goes).
- Great performance without IP ROTATION.
- Distributed (Most software runs crawling on the cloud or locally but is not distributed).
- Secure (No cloud is needed).
- Crawler find all links under a scope.
- [ ] What if the product has more than one price?
- Note that because I am removing the fragments like # this might lose me the variants of the products like https://www.douglas.de/de/p/3001005867?variant=995439
- [ ] Add time stamps to show how recent is the result also, a button to update it?
- [x] Upgrade Primeflex I have to remove all the p- prefixes.
- [x] Add the property in the UI for images src
- [x] Include the Robots.txt in the crawling process.
- [x] Robots.text contains all products under: https://www.douglas.de/api/v2/de_DE_dgl/sitemap/sitemap.xml
- [x] Added a drop menu to choose from the indexer.
Similar tools:
- Indexing:
- Most indexing is done in the cloud and needs some programming.
- Algolia
  - Great for indexing with a UI
  - Send data to the server
  - Needs some programming
  - No crawling
- Import.io
  - No indexing
- Crawling:
- ParseHub
  - Old UI and the text of the website look off.
  - Issue if you use the iframe I want to click on accept cookies, but it thinks that I want to scrape this button!!!!!!
  - You can use css if you like too.
  - They don't use chrome!!!
  - Similar to my approach
  - It opens the website by using iframe.
- Diffbot:
  - I like this one
  - It uses patterns

20.05.2023

[x] Create the result page
[x] Use API and check if it returns correct results.
[x] ~One-word search but not more; not sure why!~ I was appending to the old list instead of reassigning a new one.
[ ] I want to speed up the crawling process.
[ ] I WANT TO BE THE BEST OPEN-SOURCE WEB SCRAPER TOOL THAT EVER EXISTED!
[ ] Searching for similar tools:
- [ ] ParseHub
- [ ] They use their own browser.
- [ ] FREE Get 200 pages of data in only 40 minutes
- [ ] Standard $189 Get 200 pages of data in only 10 minutes (IP ROTATION)
- [ ] Professional $599 Get 200 pages of data in under 2 minutes (IP ROTATION)
- [ ] No rotation / For 25 products from Flaconi: Start Time 2023-05-20T22:16:52 Finished 2023-05-20T22:21:44 5min
- [ ] It took 5 min to collect 25 product and it needs 40min to find 200 products.
- [ ] My implementation took double the time but to find 1300 products. 70% faster and I can do better still.
  18.05.2023
[x] Add the selector indexer drop menu in the UI
[x] Using django cache to save indexers

15.05.2023
[ ] Design the search result page
- [x] Where to add the indexing step?
- [x] Where to add the indexing configs?
- [ ] How to persist the indexed map?
- [ ] Add property to the inspector UI.
[x] Include the Indexing step in the dashboard

14.05.2023

[ ] Support images and prices inspectors

[x] Support multi inspectors like price
[x] Add inspector type (image, text, number)
[x] Try gathering all data from flaconi (price, title and images) as testing
[ ] I should add an indicator that explains how well is the crawling process going.
[ ] Number of found documents in comparison to the found links.
[ ] Number of failed https errors.
[ ] Start looking up the indexing process
[ ] Use the inverted list first and enable it in the configuration.
[ ] User inside the search bar can have the option to get the union or intersection of the query.
[ ] For the intersection, should use k-way intersect or merge?
[x] Tokenization
[ ] For indexing in the GUI I might want to add a checkbox to choose which inspectors are needed to be indexed and what is not for example, we do not want to add the image or the price to be indexed.
[ ] Splitting between words should be configured.
[ ] Handel capital letters should be also configured is "Dior" the same as "dior" or not?
[ ] Crawlers should have an update mode that just go directly to the collected URLs!
[x] I tried crawling the whole Flaconi website BUT some URLs were not discoverable because they were drop menues that needed to be hovered on.
[ ] ~Some links were skipped because they where relative and not absolute.~ My bad, the issue was I was using 100 links as max.

[x] I ran for an hour a crawler that should crawl the flaconi website but I got this error after the 741 products:


Internal Server Error: /api/runners/start/
Traceback (most recent call last):
File "/home/oc/Documents/git/webscraper/venv/lib/python3.10/site-packages/django/core/handlers/exception.py", line 55, in inner
response = get_response(request)
File "/home/oc/Documents/git/webscraper/venv/lib/python3.10/site-packages/django/core/handlers/base.py", line 197, in _get_response
response = wrapped_callback(request, *callback_args, **callback_kwargs)
File "/home/oc/Documents/git/webscraper/venv/lib/python3.10/site-packages/django/views/decorators/csrf.py", line 54, in wrapped_view
return view_func(*args, **kwargs)
File "/home/oc/Documents/git/webscraper/venv/lib/python3.10/site-packages/rest_framework/viewsets.py", line 125, in view
return self.dispatch(request, *args, **kwargs)
File "/home/oc/Documents/git/webscraper/venv/lib/python3.10/site-packages/rest_framework/views.py", line 509, in dispatch
response = self.handle_exception(exc)
File "/home/oc/Documents/git/webscraper/venv/lib/python3.10/site-packages/rest_framework/views.py", line 469, in handle_exception
self.raise_uncaught_exception(exc)
File "/home/oc/Documents/git/webscraper/venv/lib/python3.10/site-packages/rest_framework/views.py", line 480, in raise_uncaught_exception
raise exc
File "/home/oc/Documents/git/webscraper/venv/lib/python3.10/site-packages/rest_framework/views.py", line 506, in dispatch
response = handler(request, *args, **kwargs)
File "/home/oc/Documents/git/webscraper/backend/webscraper/base/views.py", line 246, in start
find_links()
File "/home/oc/Documents/git/webscraper/backend/webscraper/base/views.py", line 237, in find_links
return find_links()
File "/home/oc/Documents/git/webscraper/backend/webscraper/base/views.py", line 237, in find_links
return find_links()
File "/home/oc/Documents/git/webscraper/backend/webscraper/base/views.py", line 237, in find_links
return find_links()
[Previous line repeated 929 more times]
File "/home/oc/Documents/git/webscraper/backend/webscraper/base/views.py", line 217, in find_links
for a in scoped_element.find_elements(By.CSS_SELECTOR, "a"):
File "/home/oc/Documents/git/webscraper/venv/lib/python3.10/site-packages/selenium/webdriver/remote/webelement.py", line 456, in find_elements
return self._execute(Command.FIND_CHILD_ELEMENTS, {"using": by, "value": value})["value"]
File "/home/oc/Documents/git/webscraper/venv/lib/python3.10/site-packages/selenium/webdriver/remote/webelement.py", line 410, in _execute
return self._parent.execute(command, params)
File "/home/oc/Documents/git/webscraper/venv/lib/python3.10/site-packages/selenium/webdriver/remote/webdriver.py", line 442, in execute
response = self.command_executor.execute(driver_command, params)
File "/home/oc/Documents/git/webscraper/venv/lib/python3.10/site-packages/selenium/webdriver/remote/remote_connection.py", line 294, in execute
return self._request(command_info[0], url, body=data)
File "/home/oc/Documents/git/webscraper/venv/lib/python3.10/site-packages/selenium/webdriver/remote/remote_connection.py", line 316, in _request
response = self._conn.request(method, url, body=body, headers=headers)
File "/home/oc/Documents/git/webscraper/venv/lib/python3.10/site-packages/urllib3/request.py", line 78, in request
return self.request_encode_body(
File "/home/oc/Documents/git/webscraper/venv/lib/python3.10/site-packages/urllib3/request.py", line 170, in request_encode_body
return self.urlopen(method, url, **extra_kw)
File "/home/oc/Documents/git/webscraper/venv/lib/python3.10/site-packages/urllib3/poolmanager.py", line 376, in urlopen
response = conn.urlopen(method, u.request_uri, **kw)
File "/home/oc/Documents/git/webscraper/venv/lib/python3.10/site-packages/urllib3/connectionpool.py", line 703, in urlopen
httplib_response = self._make_request(
File "/home/oc/Documents/git/webscraper/venv/lib/python3.10/site-packages/urllib3/connectionpool.py", line 449, in _make_request
six.raise_from(e, None)
File "<string>", line 3, in raise_from
File "/home/oc/Documents/git/webscraper/venv/lib/python3.10/site-packages/urllib3/connectionpool.py", line 444, in _make_request
httplib_response = conn.getresponse()
File "/usr/lib/python3.10/http/client.py", line 1374, in getresponse
response.begin()
File "/usr/lib/python3.10/http/client.py", line 337, in begin
self.headers = self.msg = parse_headers(self.fp)
File "/usr/lib/python3.10/http/client.py", line 236, in parse_headers
return email.parser.Parser(_class=_class).parsestr(hstring)
File "/usr/lib/python3.10/email/parser.py", line 67, in parsestr
return self.parse(StringIO(text), headersonly=headersonly)
File "/usr/lib/python3.10/email/parser.py", line 56, in parse
feedparser.feed(data)
File "/usr/lib/python3.10/email/feedparser.py", line 176, in feed
self._call_parse()
File "/usr/lib/python3.10/email/feedparser.py", line 180, in _call_parse
self._parse()
File "/usr/lib/python3.10/email/feedparser.py", line 295, in _parsegen
if self._cur.get_content_maintype() == 'message':
File "/usr/lib/python3.10/email/message.py", line 594, in get_content_maintype
ctype = self.get_content_type()
File "/usr/lib/python3.10/email/message.py", line 578, in get_content_type
value = self.get('content-type', missing)
File "/usr/lib/python3.10/email/message.py", line 471, in get
return self.policy.header_fetch_parse(k, v)
File "/usr/lib/python3.10/email/_policybase.py", line 316, in header_fetch_parse
return self._sanitize_header(name, value)
File "/usr/lib/python3.10/email/_policybase.py", line 287, in _sanitize_header
if _has_surrogates(value):
File "/usr/lib/python3.10/email/utils.py", line 57, in _has_surrogates
s.encode()
RecursionError: maximum recursion depth exceeded while calling a Python object
2023-05-14 21:20:36,526 - django.request - ERROR - Internal Server Error: /api/runners/start/
Traceback (most recent call last):
File "/home/oc/Documents/git/webscraper/venv/lib/python3.10/site-packages/django/core/handlers/exception.py", line 55, in inner
response = get_response(request)
File "/home/oc/Documents/git/webscraper/venv/lib/python3.10/site-packages/django/core/handlers/base.py", line 197, in _get_response
response = wrapped_callback(request, *callback_args, **callback_kwargs)
File "/home/oc/Documents/git/webscraper/venv/lib/python3.10/site-packages/django/views/decorators/csrf.py", line 54, in wrapped_view
return view_func(*args, **kwargs)
File "/home/oc/Documents/git/webscraper/venv/lib/python3.10/site-packages/rest_framework/viewsets.py", line 125, in view
return self.dispatch(request, *args, **kwargs)
File "/home/oc/Documents/git/webscraper/venv/lib/python3.10/site-packages/rest_framework/views.py", line 509, in dispatch
response = self.handle_exception(exc)
File "/home/oc/Documents/git/webscraper/venv/lib/python3.10/site-packages/rest_framework/views.py", line 469, in handle_exception
self.raise_uncaught_exception(exc)
File "/home/oc/Documents/git/webscraper/venv/lib/python3.10/site-packages/rest_framework/views.py", line 480, in raise_uncaught_exception
raise exc
File "/home/oc/Documents/git/webscraper/venv/lib/python3.10/site-packages/rest_framework/views.py", line 506, in dispatch
response = handler(request, *args, **kwargs)
File "/home/oc/Documents/git/webscraper/backend/webscraper/base/views.py", line 246, in start
find_links()
File "/home/oc/Documents/git/webscraper/backend/webscraper/base/views.py", line 237, in find_links
return find_links()
File "/home/oc/Documents/git/webscraper/backend/webscraper/base/views.py", line 237, in find_links
return find_links()
File "/home/oc/Documents/git/webscraper/backend/webscraper/base/views.py", line 237, in find_links
return find_links()
[Previous line repeated 929 more times]
File "/home/oc/Documents/git/webscraper/backend/webscraper/base/views.py", line 217, in find_links
for a in scoped_element.find_elements(By.CSS_SELECTOR, "a"):
File "/home/oc/Documents/git/webscraper/venv/lib/python3.10/site-packages/selenium/webdriver/remote/webelement.py", line 456, in find_elements
return self._execute(Command.FIND_CHILD_ELEMENTS, {"using": by, "value": value})["value"]
File "/home/oc/Documents/git/webscraper/venv/lib/python3.10/site-packages/selenium/webdriver/remote/webelement.py", line 410, in _execute
return self._parent.execute(command, params)
File "/home/oc/Documents/git/webscraper/venv/lib/python3.10/site-packages/selenium/webdriver/remote/webdriver.py", line 442, in execute
response = self.command_executor.execute(driver_command, params)
File "/home/oc/Documents/git/webscraper/venv/lib/python3.10/site-packages/selenium/webdriver/remote/remote_connection.py", line 294, in execute
return self._request(command_info[0], url, body=data)
File "/home/oc/Documents/git/webscraper/venv/lib/python3.10/site-packages/selenium/webdriver/remote/remote_connection.py", line 316, in _request
response = self._conn.request(method, url, body=body, headers=headers)
File "/home/oc/Documents/git/webscraper/venv/lib/python3.10/site-packages/urllib3/request.py", line 78, in request
return self.request_encode_body(
File "/home/oc/Documents/git/webscraper/venv/lib/python3.10/site-packages/urllib3/request.py", line 170, in request_encode_body
return self.urlopen(method, url, **extra_kw)
File "/home/oc/Documents/git/webscraper/venv/lib/python3.10/site-packages/urllib3/poolmanager.py", line 376, in urlopen
response = conn.urlopen(method, u.request_uri, **kw)
File "/home/oc/Documents/git/webscraper/venv/lib/python3.10/site-packages/urllib3/connectionpool.py", line 703, in urlopen
httplib_response = self._make_request(
File "/home/oc/Documents/git/webscraper/venv/lib/python3.10/site-packages/urllib3/connectionpool.py", line 449, in _make_request
six.raise_from(e, None)
File "<string>", line 3, in raise_from
File "/home/oc/Documents/git/webscraper/venv/lib/python3.10/site-packages/urllib3/connectionpool.py", line 444, in _make_request
httplib_response = conn.getresponse()
File "/usr/lib/python3.10/http/client.py", line 1374, in getresponse
response.begin()
File "/usr/lib/python3.10/http/client.py", line 337, in begin
self.headers = self.msg = parse_headers(self.fp)
File "/usr/lib/python3.10/http/client.py", line 236, in parse_headers
return email.parser.Parser(_class=_class).parsestr(hstring)
File "/usr/lib/python3.10/email/parser.py", line 67, in parsestr
return self.parse(StringIO(text), headersonly=headersonly)
File "/usr/lib/python3.10/email/parser.py", line 56, in parse
feedparser.feed(data)
File "/usr/lib/python3.10/email/feedparser.py", line 176, in feed
self._call_parse()
File "/usr/lib/python3.10/email/feedparser.py", line 180, in _call_parse
self._parse()
File "/usr/lib/python3.10/email/feedparser.py", line 295, in _parsegen
if self._cur.get_content_maintype() == 'message':
File "/usr/lib/python3.10/email/message.py", line 594, in get_content_maintype
ctype = self.get_content_type()
File "/usr/lib/python3.10/email/message.py", line 578, in get_content_type
value = self.get('content-type', missing)
File "/usr/lib/python3.10/email/message.py", line 471, in get
return self.policy.header_fetch_parse(k, v)
File "/usr/lib/python3.10/email/_policybase.py", line 316, in header_fetch_parse
return self._sanitize_header(name, value)
File "/usr/lib/python3.10/email/_policybase.py", line 287, in _sanitize_header
if _has_surrogates(value):
File "/usr/lib/python3.10/email/utils.py", line 57, in _has_surrogates
s.encode()
RecursionError: maximum recursion depth exceeded while calling a Python object

[x] I fixed the error above, reran the crawler with limited pages = 3000 and got 1692 documentsin1 hour and 39` minutes (Without multi-threading).

13.05.2023
- [x] Show the number of crawled documents. (Count the saved documents)
- [x] Show the time spent so far.
- [x] Add a pulling mechanism.
- [x] Improved monitoring
- [x] ISSUE: Seems like running multiple /start requestions to crawl are not running in parallel!
- [ ] I was wrong the logger is the same seems like it.
- [ ] Applying the Crawler configuration into the backend.
- [x] Implementing excluded URLs took much time.
- [ ] As a follow up I want eh crawlers to share the visited URLs so they do not repeat the same work.
- [x] Added label and removed the description column to have a bigger room.
  09.05.2023
[ ] Monitoring crawlers:
- [ ] Show the number of crawled documents. (Count the saved documents)
- [ ] Show a log that shows the current URL and the previous ones.
- [ ] Create a logger for each crawler.
- [ ] Add icons to indicate status (Running, stopped and completed.)
- [ ] Show the time spent so far.
[ ] Read more papers:
- [ ] Chapter 8: Web Crawling
- [ ] Types of crawlers: I am using Preferential crawlers -> Focused crawlers
- [ ] Graph traversal I am using Depth First Search but there is also Breadth First Search
- [ ] You have to handle different errors like file not found or missing errors.
- [ ] Crawler must translate relative URLs into absolute URLs: Need to obtain Base URL from HTTP header, or HTML Meta tag, or else current page path by default
- [ ] In order to avoid duplication, the crawler must transform all URLs into canonical form. Look at the slides for examples that can be fixed.
- [ ] Avoiding Spider traps:
  - [ ] Check URL length; assume spider trap above some threshold, for example 128 characters.
  - [ ] Watch for sites with very large number of URLs
  - [ ] Eliminate URLs with non-textual data types
- [ ] Page repository:
  - [ ] Naïve: store each page as a separate file (Bad)
  - [ ] Better: combine many pages into a single large file, using some XML markup to separate and identify them
  - [ ] Use Any RDBMS (I used this because of what DB offers from operations and optimization.)
    08.05.2023
[x] PBS takes lots of time to start running find out why?
- [x] I had to use start_script.sh file to submit job. Took me the whole day.
- [ ] Not sure why but you have to submit a job twice to make it work!
[x] Add a control button to the runners.
- [x] Save the runner to the DB once it starts to run.
- [x] Create a status attribute to know if it is running or stopped or completed.
- [x] Added Stop endpoint to stop crawlers.
  07.05.2023

[x] Use PBS to run the command in the background

[x] Run . /etc/profile.d/pbs.sh to source to the root.

[x] Simple hello world test:


#!/bin/bash
#PBS -N HelloWorld
#PBS -l nodes=1:ppn=1
#PBS -l walltime=00:01:00
#PBS -o HelloWorld.out
#PBS -e HelloWorld.err

# This script echoes "Hello, World!" to the standard output

echo "Hello, World!"

[x] Use an external Database instead of the django default DB.
[x] Configure the app so it can add ssh keys and communicate with the containers.
[x] Do not forget to use robot.txt file.
[x] Do not forget to make the crawling link list and also exclude URLs as a list as well.
[x] Fix the chromedriver path issue.

06.05.2023

[x] Can submit a new runner
[x] Prevent string injection by using shlex
[x] How to run the crawling process in the background?
- [x] Consider using PBS!
- [x] Create dockerfiles to do so.
- [x] I need crawlerNode to be created and installed into each PBS node.
- [x] Add basic commands to the django side to control the PBS nodes.
- [x] Push local containers to the github registry: echo $CR_PAT | docker login ghcr.io -u USERNAME --password-stdin
- [ ] I might just pull from git and run with ./manage
- [x] Installed chromedriver but when I run the server and I make a post request the RAM shoots and the app crashes. Changing the heap space fixed it: shm_size: '2gb' I changed in the docker-compose file.
- [x] Had to add no-sandbox option to fix and issue with the driver working in the docker image.
- [x] Run curl command to test it.
```
curl -X POST -u admin:admin pbs-sim-node:8000/api/runners/start/ -H 'Content-Type: application/json' -d '{"description": 
"","name": "test","crawler": 3}'
```

01.05.2023

[x] FInding the best max_workers, 10 products.
- [x] max_workers: 1, time needed: 84.974s
- [x] max_workers: 2 time needed: 75.359s
- [x] max_workers: 4 time needed: 81.244s
- [x] max_workers: 8 time needed: 77.467s
- [x] max_workers: 10 time needed: 84.769s
[x] Recursion speed up the process instead of using a threading pool.
[ ] I should start thinking about how to save into DB the generic selectors.
- [x] InspectorValue is used now.
- [ ] All different values can be saved as a string and with the right type saved with it!
[ ] How about updating the same crawler to fetch the latest price updates and new products?
[ ] Start reading the configurations from Frontend!
- [ ] Fix the run form in the angular.
[ ] toPromise is deprecated in Angular I should replace this.

30.04.2023

[x] Choose one page for starting crawling and have a target of products.
- [x] Seed URL: https://www.flaconi.de/damen-duftsets/
- [x] Base URL: https://www.flaconi.de
- [x] Scope div: .e-tastic__flaconi-product-list
[x] How to handle the CSS classes that contain random numbers? For example Flaconi product name class is BrandName--1vge3k, to fix this we can use contain instead of equal [class*='BrandName']
[ ] Websites do not use links to go to the next page. Instead, they use load more Mehr laden.
- [ ] I can add actions before the crawling like click on button and wait!
[ ] When you only want to crawl one product, it can be that it will collect more as it will start to collect all links first. Having a scope div to only crawl from scope_div is a good idea. The issue here is that sometimes some random ads are added to the scope_div not sure how to fix this.
- [x] Use a limited page number (This should be close to the number of products you want).
- [x] Restrict the depth of crawling to only one! (This means we only search for links in the seed link only!).
- [ ] Exclude links that can cause issues.
[ ] What if the base page does not exist? We should throw an error.
[ ] How to handle multi-lines text names? For example, in Flacon product name is made up by HUGO BOSS \nBOSS Orange \nDuftset
[x] How can I improve the performance and keep avoiding DOS?
- [x] Using one tab rather than opening a new Chrome tab each time? Compare it to see if it is faster indeed!
- [x] Multi threading how many? Configurable!
- [x] Maybe try different browsers instead of chromium?
- [x] Findelement might add more delay?

Alhajras / webscraper

Documentation #9

28.05.2023

27.05.2023

First milestone 29.05.2023

22.05.2023

21.05.2023

20.05.2023

18.05.2023

15.05.2023

14.05.2023

13.05.2023

09.05.2023

08.05.2023

07.05.2023

06.05.2023

01.05.2023

30.04.2023