[x] Facebook is unpredictable! The first page sometimes gives this:
[x] I found that facebook disallow crawling :( https://www.facebook.com/robots.txt
[x] One extra use case is for testing the invalid links inside the website!
[ ] Test copy rights!
27.05.2023
[x] Started on the second milestone
[x] Start crawling Facebook posts.
[x] They use modification, making the classes hard to read.
They generate different ids and classes on each load, and they are unreadable (.x193iq5w .xeuugli)
They need to add a waiting event and also scrolling also, we do not want to collect URLs
Add before and after actions
Add find text to find hate speech!
I really really like the selector feature that ParseHub is using it can detect which is the perfect selector that can be used for your element.
Facebook is complaining that I am using an unsupported browser :(
[x] Updated Chrome/22.0.1216.0 to Chrome/110.0.5481.77
[ ] I should allow users to upgrade their browser versions to avoid this issue in the future.
[x] I can crawl one post, yes!
[x] I need to add a controlled workflow like actions.
[x] Added the actions in the UI and the DB but the have no affect yet!
First milestone 29.05.2023
[x] Can submit a crawling job from GUI 2 days
[x] Simple Job monitoring and controlling 2 days
[x] Basic Index 2 days
[x] Index from GUI 2 days
[x] Simple Result page 2 days
22.05.2023
[x] Add a timeline to the status of the crawlers.
21.05.2023
Try crawling and indexing a new website and list what should be changed and how to solve them.
douglas.de is the next step.
What differentiates this thesis?
Crawling + indexing (No other software is doing it as far as my search goes).
Great performance without IP ROTATION.
Distributed (Most software runs crawling on the cloud or locally but is not distributed).
Secure (No cloud is needed).
Crawler find all links under a scope.
[ ] What if the product has more than one price?
Note that because I am removing the fragments like # this might lose me the variants of the products like https://www.douglas.de/de/p/3001005867?variant=995439
[ ] Add time stamps to show how recent is the result also, a button to update it?
[x] Upgrade Primeflex I have to remove all the p- prefixes.
[x] Add the property in the UI for images src
[x] Include the Robots.txt in the crawling process.
[x] Robots.text contains all products under: https://www.douglas.de/api/v2/de_DE_dgl/sitemap/sitemap.xml
[x] Added a drop menu to choose from the indexer.
Similar tools:
Indexing:
Most indexing is done in the cloud and needs some programming.
Algolia
Great for indexing with a UI
Send data to the server
Needs some programming
No crawling
Import.io
No indexing
Crawling:
ParseHub
Old UI and the text of the website look off.
Issue if you use the iframe I want to click on accept cookies, but it thinks that I want to scrape this button!!!!!!
You can use css if you like too.
They don't use chrome!!!
Similar to my approach
It opens the website by using iframe.
Diffbot:
I like this one
It uses patterns
20.05.2023
[x] Create the result page
[x] Use API and check if it returns correct results.
[x] ~One-word search but not more; not sure why!~ I was appending to the old list instead of reassigning a new one.
[ ] I want to speed up the crawling process.
[ ] I WANT TO BE THE BEST OPEN-SOURCE WEB SCRAPER TOOL THAT EVER EXISTED!
[ ] Searching for similar tools:
[ ] ParseHub
[ ] They use their own browser.
[ ] FREE Get 200 pages of data in only 40 minutes
[ ] Standard $189 Get 200 pages of data in only 10 minutes (IP ROTATION)
[ ] Professional $599 Get 200 pages of data in under 2 minutes (IP ROTATION)
[ ] No rotation / For 25 products from Flaconi: Start Time 2023-05-20T22:16:52 Finished 2023-05-20T22:21:44 5min
[ ] It took 5 min to collect 25 product and it needs 40min to find 200 products.
[ ] My implementation took double the time but to find 1300 products. 70% faster and I can do better still.
18.05.2023
[x] Add the selector indexer drop menu in the UI
[x] Using django cache to save indexers
15.05.2023
[ ] Design the search result page
[x] Where to add the indexing step?
[x] Where to add the indexing configs?
[ ] How to persist the indexed map?
[ ] Add property to the inspector UI.
[x] Include the Indexing step in the dashboard
14.05.2023
[ ] Support images and prices inspectors
[x] Support multi inspectors like price
[x] Add inspector type (image, text, number)
[x] Try gathering all data from flaconi (price, title and images) as testing
[ ] I should add an indicator that explains how well is the crawling process going.
[ ] Number of found documents in comparison to the found links.
[ ] Number of failed https errors.
[ ] Start looking up the indexing process
[ ] Use the inverted list first and enable it in the configuration.
[ ] User inside the search bar can have the option to get the union or intersection of the query.
[ ] For the intersection, should use k-way intersect or merge?
[x] Tokenization
[ ] For indexing in the GUI I might want to add a checkbox to choose which inspectors are needed to be indexed and what is not for example, we do not want to add the image or the price to be indexed.
[ ] Splitting between words should be configured.
[ ] Handel capital letters should be also configured is "Dior" the same as "dior" or not?
[ ] Crawlers should have an update mode that just go directly to the collected URLs!
[x] I tried crawling the whole Flaconi website BUT some URLs were not discoverable because they were drop menues that needed to be hovered on.
[ ] ~Some links were skipped because they where relative and not absolute.~ My bad, the issue was I was using 100 links as max.
[x] I ran for an hour a crawler that should crawl the flaconi website but I got this error after the 741 products:
Internal Server Error: /api/runners/start/
Traceback (most recent call last):
File "/home/oc/Documents/git/webscraper/venv/lib/python3.10/site-packages/django/core/handlers/exception.py", line 55, in inner
response = get_response(request)
File "/home/oc/Documents/git/webscraper/venv/lib/python3.10/site-packages/django/core/handlers/base.py", line 197, in _get_response
response = wrapped_callback(request, *callback_args, **callback_kwargs)
File "/home/oc/Documents/git/webscraper/venv/lib/python3.10/site-packages/django/views/decorators/csrf.py", line 54, in wrapped_view
return view_func(*args, **kwargs)
File "/home/oc/Documents/git/webscraper/venv/lib/python3.10/site-packages/rest_framework/viewsets.py", line 125, in view
return self.dispatch(request, *args, **kwargs)
File "/home/oc/Documents/git/webscraper/venv/lib/python3.10/site-packages/rest_framework/views.py", line 509, in dispatch
response = self.handle_exception(exc)
File "/home/oc/Documents/git/webscraper/venv/lib/python3.10/site-packages/rest_framework/views.py", line 469, in handle_exception
self.raise_uncaught_exception(exc)
File "/home/oc/Documents/git/webscraper/venv/lib/python3.10/site-packages/rest_framework/views.py", line 480, in raise_uncaught_exception
raise exc
File "/home/oc/Documents/git/webscraper/venv/lib/python3.10/site-packages/rest_framework/views.py", line 506, in dispatch
response = handler(request, *args, **kwargs)
File "/home/oc/Documents/git/webscraper/backend/webscraper/base/views.py", line 246, in start
find_links()
File "/home/oc/Documents/git/webscraper/backend/webscraper/base/views.py", line 237, in find_links
return find_links()
File "/home/oc/Documents/git/webscraper/backend/webscraper/base/views.py", line 237, in find_links
return find_links()
File "/home/oc/Documents/git/webscraper/backend/webscraper/base/views.py", line 237, in find_links
return find_links()
[Previous line repeated 929 more times]
File "/home/oc/Documents/git/webscraper/backend/webscraper/base/views.py", line 217, in find_links
for a in scoped_element.find_elements(By.CSS_SELECTOR, "a"):
File "/home/oc/Documents/git/webscraper/venv/lib/python3.10/site-packages/selenium/webdriver/remote/webelement.py", line 456, in find_elements
return self._execute(Command.FIND_CHILD_ELEMENTS, {"using": by, "value": value})["value"]
File "/home/oc/Documents/git/webscraper/venv/lib/python3.10/site-packages/selenium/webdriver/remote/webelement.py", line 410, in _execute
return self._parent.execute(command, params)
File "/home/oc/Documents/git/webscraper/venv/lib/python3.10/site-packages/selenium/webdriver/remote/webdriver.py", line 442, in execute
response = self.command_executor.execute(driver_command, params)
File "/home/oc/Documents/git/webscraper/venv/lib/python3.10/site-packages/selenium/webdriver/remote/remote_connection.py", line 294, in execute
return self._request(command_info[0], url, body=data)
File "/home/oc/Documents/git/webscraper/venv/lib/python3.10/site-packages/selenium/webdriver/remote/remote_connection.py", line 316, in _request
response = self._conn.request(method, url, body=body, headers=headers)
File "/home/oc/Documents/git/webscraper/venv/lib/python3.10/site-packages/urllib3/request.py", line 78, in request
return self.request_encode_body(
File "/home/oc/Documents/git/webscraper/venv/lib/python3.10/site-packages/urllib3/request.py", line 170, in request_encode_body
return self.urlopen(method, url, **extra_kw)
File "/home/oc/Documents/git/webscraper/venv/lib/python3.10/site-packages/urllib3/poolmanager.py", line 376, in urlopen
response = conn.urlopen(method, u.request_uri, **kw)
File "/home/oc/Documents/git/webscraper/venv/lib/python3.10/site-packages/urllib3/connectionpool.py", line 703, in urlopen
httplib_response = self._make_request(
File "/home/oc/Documents/git/webscraper/venv/lib/python3.10/site-packages/urllib3/connectionpool.py", line 449, in _make_request
six.raise_from(e, None)
File "<string>", line 3, in raise_from
File "/home/oc/Documents/git/webscraper/venv/lib/python3.10/site-packages/urllib3/connectionpool.py", line 444, in _make_request
httplib_response = conn.getresponse()
File "/usr/lib/python3.10/http/client.py", line 1374, in getresponse
response.begin()
File "/usr/lib/python3.10/http/client.py", line 337, in begin
self.headers = self.msg = parse_headers(self.fp)
File "/usr/lib/python3.10/http/client.py", line 236, in parse_headers
return email.parser.Parser(_class=_class).parsestr(hstring)
File "/usr/lib/python3.10/email/parser.py", line 67, in parsestr
return self.parse(StringIO(text), headersonly=headersonly)
File "/usr/lib/python3.10/email/parser.py", line 56, in parse
feedparser.feed(data)
File "/usr/lib/python3.10/email/feedparser.py", line 176, in feed
self._call_parse()
File "/usr/lib/python3.10/email/feedparser.py", line 180, in _call_parse
self._parse()
File "/usr/lib/python3.10/email/feedparser.py", line 295, in _parsegen
if self._cur.get_content_maintype() == 'message':
File "/usr/lib/python3.10/email/message.py", line 594, in get_content_maintype
ctype = self.get_content_type()
File "/usr/lib/python3.10/email/message.py", line 578, in get_content_type
value = self.get('content-type', missing)
File "/usr/lib/python3.10/email/message.py", line 471, in get
return self.policy.header_fetch_parse(k, v)
File "/usr/lib/python3.10/email/_policybase.py", line 316, in header_fetch_parse
return self._sanitize_header(name, value)
File "/usr/lib/python3.10/email/_policybase.py", line 287, in _sanitize_header
if _has_surrogates(value):
File "/usr/lib/python3.10/email/utils.py", line 57, in _has_surrogates
s.encode()
RecursionError: maximum recursion depth exceeded while calling a Python object
2023-05-14 21:20:36,526 - django.request - ERROR - Internal Server Error: /api/runners/start/
Traceback (most recent call last):
File "/home/oc/Documents/git/webscraper/venv/lib/python3.10/site-packages/django/core/handlers/exception.py", line 55, in inner
response = get_response(request)
File "/home/oc/Documents/git/webscraper/venv/lib/python3.10/site-packages/django/core/handlers/base.py", line 197, in _get_response
response = wrapped_callback(request, *callback_args, **callback_kwargs)
File "/home/oc/Documents/git/webscraper/venv/lib/python3.10/site-packages/django/views/decorators/csrf.py", line 54, in wrapped_view
return view_func(*args, **kwargs)
File "/home/oc/Documents/git/webscraper/venv/lib/python3.10/site-packages/rest_framework/viewsets.py", line 125, in view
return self.dispatch(request, *args, **kwargs)
File "/home/oc/Documents/git/webscraper/venv/lib/python3.10/site-packages/rest_framework/views.py", line 509, in dispatch
response = self.handle_exception(exc)
File "/home/oc/Documents/git/webscraper/venv/lib/python3.10/site-packages/rest_framework/views.py", line 469, in handle_exception
self.raise_uncaught_exception(exc)
File "/home/oc/Documents/git/webscraper/venv/lib/python3.10/site-packages/rest_framework/views.py", line 480, in raise_uncaught_exception
raise exc
File "/home/oc/Documents/git/webscraper/venv/lib/python3.10/site-packages/rest_framework/views.py", line 506, in dispatch
response = handler(request, *args, **kwargs)
File "/home/oc/Documents/git/webscraper/backend/webscraper/base/views.py", line 246, in start
find_links()
File "/home/oc/Documents/git/webscraper/backend/webscraper/base/views.py", line 237, in find_links
return find_links()
File "/home/oc/Documents/git/webscraper/backend/webscraper/base/views.py", line 237, in find_links
return find_links()
File "/home/oc/Documents/git/webscraper/backend/webscraper/base/views.py", line 237, in find_links
return find_links()
[Previous line repeated 929 more times]
File "/home/oc/Documents/git/webscraper/backend/webscraper/base/views.py", line 217, in find_links
for a in scoped_element.find_elements(By.CSS_SELECTOR, "a"):
File "/home/oc/Documents/git/webscraper/venv/lib/python3.10/site-packages/selenium/webdriver/remote/webelement.py", line 456, in find_elements
return self._execute(Command.FIND_CHILD_ELEMENTS, {"using": by, "value": value})["value"]
File "/home/oc/Documents/git/webscraper/venv/lib/python3.10/site-packages/selenium/webdriver/remote/webelement.py", line 410, in _execute
return self._parent.execute(command, params)
File "/home/oc/Documents/git/webscraper/venv/lib/python3.10/site-packages/selenium/webdriver/remote/webdriver.py", line 442, in execute
response = self.command_executor.execute(driver_command, params)
File "/home/oc/Documents/git/webscraper/venv/lib/python3.10/site-packages/selenium/webdriver/remote/remote_connection.py", line 294, in execute
return self._request(command_info[0], url, body=data)
File "/home/oc/Documents/git/webscraper/venv/lib/python3.10/site-packages/selenium/webdriver/remote/remote_connection.py", line 316, in _request
response = self._conn.request(method, url, body=body, headers=headers)
File "/home/oc/Documents/git/webscraper/venv/lib/python3.10/site-packages/urllib3/request.py", line 78, in request
return self.request_encode_body(
File "/home/oc/Documents/git/webscraper/venv/lib/python3.10/site-packages/urllib3/request.py", line 170, in request_encode_body
return self.urlopen(method, url, **extra_kw)
File "/home/oc/Documents/git/webscraper/venv/lib/python3.10/site-packages/urllib3/poolmanager.py", line 376, in urlopen
response = conn.urlopen(method, u.request_uri, **kw)
File "/home/oc/Documents/git/webscraper/venv/lib/python3.10/site-packages/urllib3/connectionpool.py", line 703, in urlopen
httplib_response = self._make_request(
File "/home/oc/Documents/git/webscraper/venv/lib/python3.10/site-packages/urllib3/connectionpool.py", line 449, in _make_request
six.raise_from(e, None)
File "<string>", line 3, in raise_from
File "/home/oc/Documents/git/webscraper/venv/lib/python3.10/site-packages/urllib3/connectionpool.py", line 444, in _make_request
httplib_response = conn.getresponse()
File "/usr/lib/python3.10/http/client.py", line 1374, in getresponse
response.begin()
File "/usr/lib/python3.10/http/client.py", line 337, in begin
self.headers = self.msg = parse_headers(self.fp)
File "/usr/lib/python3.10/http/client.py", line 236, in parse_headers
return email.parser.Parser(_class=_class).parsestr(hstring)
File "/usr/lib/python3.10/email/parser.py", line 67, in parsestr
return self.parse(StringIO(text), headersonly=headersonly)
File "/usr/lib/python3.10/email/parser.py", line 56, in parse
feedparser.feed(data)
File "/usr/lib/python3.10/email/feedparser.py", line 176, in feed
self._call_parse()
File "/usr/lib/python3.10/email/feedparser.py", line 180, in _call_parse
self._parse()
File "/usr/lib/python3.10/email/feedparser.py", line 295, in _parsegen
if self._cur.get_content_maintype() == 'message':
File "/usr/lib/python3.10/email/message.py", line 594, in get_content_maintype
ctype = self.get_content_type()
File "/usr/lib/python3.10/email/message.py", line 578, in get_content_type
value = self.get('content-type', missing)
File "/usr/lib/python3.10/email/message.py", line 471, in get
return self.policy.header_fetch_parse(k, v)
File "/usr/lib/python3.10/email/_policybase.py", line 316, in header_fetch_parse
return self._sanitize_header(name, value)
File "/usr/lib/python3.10/email/_policybase.py", line 287, in _sanitize_header
if _has_surrogates(value):
File "/usr/lib/python3.10/email/utils.py", line 57, in _has_surrogates
s.encode()
RecursionError: maximum recursion depth exceeded while calling a Python object
[x] I fixed the error above, reran the crawler with limited pages = 3000 and got 1692 documentsin1 hour and 39` minutes (Without multi-threading).
13.05.2023
[x] Show the number of crawled documents. (Count the saved documents)
[x] Show the time spent so far.
[x] Add a pulling mechanism.
[x] Improved monitoring
[x] ISSUE: Seems like running multiple /start requestions to crawl are not running in parallel!
[ ] I was wrong the logger is the same seems like it.
[ ] Applying the Crawler configuration into the backend.
[x] Implementing excluded URLs took much time.
[ ] As a follow up I want eh crawlers to share the visited URLs so they do not repeat the same work.
[x] Added label and removed the description column to have a bigger room.
09.05.2023
[ ] Monitoring crawlers:
[ ] Show the number of crawled documents. (Count the saved documents)
[ ] Show a log that shows the current URL and the previous ones.
[ ] Create a logger for each crawler.
[ ] Add icons to indicate status (Running, stopped and completed.)
[ ] Show the time spent so far.
[ ] Read more papers:
[ ] Chapter 8: Web Crawling
[ ] Types of crawlers: I am using Preferential crawlers -> Focused crawlers
[ ] Graph traversal I am using Depth First Search but there is also Breadth First Search
[ ] You have to handle different errors like file not found or missing errors.
[ ] Crawler must translate relative URLs into absolute URLs: Need to obtain Base URL from HTTP header, or HTML
Meta tag, or else current page path by default
[ ] In order to avoid duplication, the crawler must transform all URLs into canonical form. Look at the slides for examples that can be fixed.
[ ] Avoiding Spider traps:
[ ] Check URL length; assume spider trap above some threshold,
for example 128 characters.
[ ] Watch for sites with very large number of URLs
[ ] Eliminate URLs with non-textual data types
[ ] Page repository:
[ ] Naïve: store each page as a separate file (Bad)
[ ] Better: combine many pages into a single large file, using some
XML markup to separate and identify them
[ ] Use Any RDBMS (I used this because of what DB offers from operations and optimization.)
08.05.2023
[x] PBS takes lots of time to start running find out why?
[x] I had to use start_script.sh file to submit job. Took me the whole day.
[ ] Not sure why but you have to submit a job twice to make it work!
[x] Add a control button to the runners.
[x] Save the runner to the DB once it starts to run.
[x] Create a status attribute to know if it is running or stopped or completed.
[x] Added Stop endpoint to stop crawlers.
07.05.2023
[x] Use PBS to run the command in the background
[x] Run . /etc/profile.d/pbs.sh to source to the root.
[x] Simple hello world test:
#!/bin/bash
#PBS -N HelloWorld
#PBS -l nodes=1:ppn=1
#PBS -l walltime=00:01:00
#PBS -o HelloWorld.out
#PBS -e HelloWorld.err
# This script echoes "Hello, World!" to the standard output
echo "Hello, World!"
[x] Use an external Database instead of the django default DB.
[x] Configure the app so it can add ssh keys and communicate with the containers.
[x] Do not forget to use robot.txt file.
[x] Do not forget to make the crawling link list and also exclude URLs as a list as well.
[x] Fix the chromedriver path issue.
06.05.2023
[x] Can submit a new runner
[x] Prevent string injection by using shlex
[x] How to run the crawling process in the background?
[x] Consider using PBS!
[x] Create dockerfiles to do so.
[x] I need crawlerNode to be created and installed into each PBS node.
[x] Add basic commands to the django side to control the PBS nodes.
[x] Push local containers to the github registry: echo $CR_PAT | docker login ghcr.io -u USERNAME --password-stdin
[ ] I might just pull from git and run with ./manage
[x] Installed chromedriver but when I run the server and I make a post request the RAM shoots and the app crashes. Changing the heap space fixed it: shm_size: '2gb' I changed in the docker-compose file.
[x] Had to add no-sandbox option to fix and issue with the driver working in the docker image.
[x] How to handle the CSS classes that contain random numbers? For example Flaconi product name class is BrandName--1vge3k, to fix this we can use contain instead of equal [class*='BrandName']
[ ] Websites do not use links to go to the next page. Instead, they use load moreMehr laden.
[ ] I can add actions before the crawling like click on button and wait!
[ ] When you only want to crawl one product, it can be that it will collect more as it will start to collect all links first. Having a scope div to only crawl from scope_div is a good idea. The issue here is that sometimes some random ads are added to the scope_div not sure how to fix this.
[x] Use a limited page number (This should be close to the number of products you want).
[x] Restrict the depth of crawling to only one! (This means we only search for links in the seed link only!).
[ ] Exclude links that can cause issues.
[ ] What if the base page does not exist? We should throw an error.
[ ] How to handle multi-lines text names? For example, in Flacon product name is made up by HUGO BOSS \nBOSS Orange \nDuftset
[x] How can I improve the performance and keep avoiding DOS?
[x] Using one tab rather than opening a new Chrome tab each time? Compare it to see if it is faster indeed!
[x] Multi threading how many? Configurable!
[x] Maybe try different browsers instead of chromium?
28.05.2023
https://www.facebook.com/robots.txt
27.05.2023
.x193iq5w .xeuugli
)Chrome/22.0.1216.0
toChrome/110.0.5481.77
First milestone 29.05.2023
22.05.2023
21.05.2023
https://www.douglas.de/de/p/3001005867?variant=995439
p-
prefixes.src
Robots.text
contains all products under:https://www.douglas.de/api/v2/de_DE_dgl/sitemap/sitemap.xml
20.05.2023
[x] Create the result page
[x] Use API and check if it returns correct results.
[x] ~One-word search but not more; not sure why!~ I was appending to the old list instead of reassigning a new one.
[ ] I want to speed up the crawling process.
[ ] I WANT TO BE THE BEST OPEN-SOURCE WEB SCRAPER TOOL THAT EVER EXISTED!
[ ] Searching for similar tools:
18.05.2023
[x] Add the selector indexer drop menu in the UI
[x] Using django cache to save indexers
15.05.2023
[ ] Design the search result page
[x] Include the Indexing step in the dashboard
14.05.2023
[ ] Support images and prices inspectors
indexing
processk-way intersect or merge
?[x] I fixed the error above, reran the crawler with
limited pages = 3000
and got 1692 documentsin
1 hour and 39` minutes (Without multi-threading).13.05.2023
/start
requestions to crawl are not running in parallel!09.05.2023
[ ] Monitoring crawlers:
[ ] Read more papers:
Preferential crawlers -> Focused crawlers
Depth First Search
but there is alsoBreadth First Search
08.05.2023
[x] PBS takes lots of time to start running find out why?
start_script.sh
file to submit job. Took me the whole day.[x] Add a control button to the runners.
07.05.2023
[x] Use PBS to run the command in the background
. /etc/profile.d/pbs.sh
to source to the root.[x] Simple hello world test:
[x] Use an external Database instead of the django default DB.
[x] Configure the app so it can add ssh keys and communicate with the containers.
[x] Do not forget to use
robot.txt
file.[x] Do not forget to make the crawling link list and also exclude URLs as a list as well.
[x] Fix the chromedriver path issue.
06.05.2023
shlex
echo $CR_PAT | docker login ghcr.io -u USERNAME --password-stdin
./manage
shm_size: '2gb'
I changed in the docker-compose file.no-sandbox
option to fix and issue with the driver working in the docker image.01.05.2023
30.04.2023
.e-tastic__flaconi-product-list
BrandName--1vge3k
, to fix this we can use contain instead of equal[class*='BrandName']
HUGO BOSS \nBOSS Orange \nDuftset