[x]
testing_urls = [
crawler.seed_url,
crawler.seed_url,
]
Thread: 2
Level is used with FIFO DSC
Docs: 101
Visited Links: 0
total_non_useful_links: 0
233.37684226036072 -> 3.8 min
{48368: 344, 48367: 464}
Thread: 2
Level is used with FIFO DSC
Docs: 101
Visited Links: 0
total_non_useful_links: 0
223.89409279823303
{50977: 304, 50975: 504}
[x]
testing_urls = [
crawler.seed_url,
crawler.seed_url,
]
Thread: 2
Level is used with FIFO ASC
Docs: 101
Visited Links: 0
total_non_useful_links: 0
551.1641173362732 -> 9.1 min
{25883: 376, 25881: 432}
Level is used with FIFO DSC
Docs: 100
Visited Links: 0
total_non_useful_links: 0
456.75313687324524
{35956: 464, 35955: 336}
08.06.2023
[x] Allow user to use pop() or pop(0)
[x] Crawl based on levels
[x] Allow user to choose if the crawl should return a list or one document only
05.06.2023
[x] Save the whole document at once instead of saving each inspector value at a time. This will ensure data integrity and solve the issue when I want to crawl all movies list in one page.
[x] Think about making fingerprint for updating.
[x] Test the idea of crawling based on level
More testings:
------------------------ After adding the bulk of the document saving ---------
[x]
testing_urls = [
crawler.seed_url,
]
Last in first serve
Thread: 2
Visited Links: 164
total_non_useful_links: 1623
358.93656516075134
{126713: 216, 126712: 592}
[x]
testing_urls = [
crawler.seed_url,
]
First in first serve pipe
Thread: 2
Visited Links: 207
total_non_useful_links: 3468
677.7878086566925
{121744: 488, 121743: 312}
Issue I found that only one thread continues, and the others die! Or when one is done, I should reuse it.
Douglas Start 10:40 -> 12:53, docs 2902, only one thread survived!
Handel stale exception Message: stale element reference: stale element not found (Session info: headless chrome=113.0.5672.126) it stops the crawlers threads.
Found that the crawler is trying to visit Visited Links: 1775 links to only find 100 docs!
12.06.2023
11.06.2023
I found that the way I am testing is not the best way I should do the following for testing:
Make threads configurable
Make allow multi configurable
Test against https://crawler-test.com/
Threads 4 Visited Links: 1093 total_non_useful_links: 352 757.6430785655975 -> 12.6 {73746: 566, 73745: 358, 73743: 558, 73747: 160}
Threads 4 Visited Links: 1094 total_non_useful_links: 349 914.214373588562 -> 15min {70103: 582, 70102: 548, 70104: 318, 70105: 284}
Visited Links: 299 total_non_useful_links: 678 345.136757850647 {50178: 378}
Visited Links: 958 total_non_useful_links: 326 1067.6898975372314 {52349: 1356}
[x] testing_urls = [ crawler.seed_url, ] Thread: 4 Docs: 4251 62 min
[x] testing_urls = [ crawler.seed_url, ] Thread: 1 Docs: 2000.0 Visited Links: 0 total_non_useful_links: 0 1189.5401849746704 {18362: 9002}
10.06.2023
09.06.2023
Thread: 2 Docs: 816 Visited Links: 0 total_non_useful_links: 0 2622.433438539505 {113043: 4660, 113042: 3852}
Thread: 1 Docs: 817 Visited Links: 0 total_non_useful_links: 0 3380.6824486255646 {93606: 6550}
Thread: 4
Docs: 800 Visited Links: 0 total_non_useful_links: 0 2338.493413209915 -> 39min {63675: 2024, 63677: 2302, 63673: 2514, 63676: 1248}
Thread: 4
Thread: 4 Docs: 103 Visited Links: 0 total_non_useful_links: 0 198.45312595367432 {60160: 248, 60157: 256, 60161: 224, 60159: 152}
Thread: 2 Level is used with FIFO DSC Docs: 101 Visited Links: 0 total_non_useful_links: 0 223.89409279823303 {50977: 304, 50975: 504}
Level is used with FIFO DSC Docs: 100 Visited Links: 0 total_non_useful_links: 0 456.75313687324524 {35956: 464, 35955: 336}
08.06.2023
05.06.2023
More testings: ------------------------ After adding the bulk of the document saving ---------
[x] testing_urls = [ crawler.seed_url, ] Last in first serve Thread: 2 Visited Links: 164 total_non_useful_links: 1623 358.93656516075134 {126713: 216, 126712: 592}
[x] testing_urls = [ crawler.seed_url, ] First in first serve pipe Thread: 2 Visited Links: 207 total_non_useful_links: 3468 677.7878086566925 {121744: 488, 121743: 312}
[x] testing_urls = [ crawler.seed_url, ] Thread: 2 Visited Links: 151 total_non_useful_links: 474 343.1837058067322 {96549: 1262, 96550: 1390}
[x] testing_urls = [ crawler.seed_url, ]
first try: Thread: 1 Visited Links: 137 total_non_useful_links: 628 420.3422930240631 {94184: 1986}
Note: the bulk made the algorithm slower
------------------------ Before adding the bulk of the document saving ---------
[x] testing_urls = [ crawler.seed_url, crawler.seed_url, crawler.seed_url, crawler.seed_url, crawler.seed_url, crawler.seed_url, ] 6 Threads only Visited Links: 238 total_non_useful_links: 2771 558.6237244606018 {82180: 320, 82179: 240, 82181: 64, 82184: 160, 82183: 8, 82178: 18
[x] testing_urls = [ crawler.seed_url, "https://www.douglas.de/de/c/parfum/damenduefte/duschpflege/010107", ]
Change the urls 2 Threads only First run Visited Links: 126 total_non_useful_links: 470 215.15883469581604 -> 3.5m {69885: 344, 69884: 464}
Second run: Visited Links: 126 total_non_useful_links: 473 225.6277961730957 -> 3.75 {72023: 344, 72022: 456}
2 Threads only First run Visited Links: 135 total_non_useful_links: 1453 306.88921093940735 -> 5.1 {65029: 584, 65030: 216}
Second run: Visited Links: 141 total_non_useful_links: 1523 295.5270366668701 -> {67435: 216, 67434: 584}
2 Threads only Visited Links: 136 total_non_useful_links: 471 283.0769159793854 {61601: 456, 61602: 352}
[x] testing_urls = [ crawler.seed_url, "https://www.douglas.de/de/c/parfum/damenduefte/koerperpflege/010108", "https://www.douglas.de/de/c/parfum/damenduefte/duschpflege/010107", "https://www.douglas.de/de/c/parfum/damenduefte/parfum/010106", ] Visited Links: 149 Time 242.76s -> 4min {44798: 256, 44796: 344, 44797: 200}
[x] testing_urls = [ crawler.seed_url, "https://www.douglas.de/de/c/parfum/damenduefte/koerperpflege/010108", "https://www.douglas.de/de/c/parfum/damenduefte/duschpflege/010107", "https://www.douglas.de/de/c/parfum/damenduefte/parfum/010106", ] Visited Links: 147 total_non_useful_links: 1594 250.88929748535156 -> 4.1m {48936: 256, 48934: 344, 48935: 200}
[x] testing_urls = [ crawler.seed_url, crawler.seed_url, crawler.seed_url, crawler.seed_url, ] Visited Links: 130 total_non_useful_links: 676 226.44421529769897 -> 3.7m {52264: 240, 52262: 72, 52263: 216, 52265: 288}
[x] testing_urls = [ crawler.seed_url, crawler.seed_url, crawler.seed_url, crawler.seed_url, ] Visited Links: 163 total_non_useful_links: 1477 294.63897037506104 {55131: 88, 55129: 224, 55130: 400, 55128: 112}
[x] testing_urls = [ crawler.seed_url, crawler.seed_url, crawler.seed_url, crawler.seed_url, ] Visited Links: 140 total_non_useful_links: 819 205.24403977394104 -> 3.4m {58904: 272, 58905: 160, 58906: 288, 58907: 88}
25.05.2023
[x] Start Ranking:
[x] configuration only uses term frequency tf
[x] Add stop words (Skip them)
[x] BM25
[x] Add k and b as parameters
[ ] Maximum result on the search page
[x] Each inspector can have a weight.
[x] Add (remove short words with a parameter of maximum chars)
[ ] Evaluate Crawler
[ ] Configuration: No multi threading: Took: 938.077956 s -> 15.5 min -> 100 Docs
[ ] Seed:
https://www.douglas.de/de/c/parfum/damenduefte/duftsets/010111
[ ] Configuration: 3 Threads: Took:269.389 s -> 4.5 min -> 100 Docs, threads_metrics: {94014: 50, 94015: 49, 94017: 1}
[ ] Seeds:
https://www.douglas.de/de/c/parfum/damenduefte/duftsets/010111
https://www.douglas.de/de/c/parfum/damenduefte/koerperpflege/010108
https://www.douglas.de/de/c/parfum/damenduefte/duschpflege/010107
[ ] Configuration: 4 Threads: Took:232.36 s -> 3.8 min -> 100 Docs, threads_metrics: {94014: 50, 94015: 49, 94017: 1}
[ ] Seeds:
https://www.douglas.de/de/c/parfum/damenduefte/duftsets/010111
https://www.douglas.de/de/c/parfum/damenduefte/koerperpflege/010108
https://www.douglas.de/de/c/parfum/damenduefte/duschpflege/010107
https://www.douglas.de/de/c/parfum/damenduefte/parfum/010106
[ ] Configuration: 4 Threads: Took:159.169141 s -> 2.6 min -> 100 Docs, threads_metrics: {109179: 16, 109182: 9, 109181: 35, 109183: 40}
[ ] Seeds:
https://www.douglas.de/de/c/parfum/damenduefte/duftsets/010111
https://www.douglas.de/de/c/parfum/damenduefte/koerperpflege/010108
https://www.douglas.de/de/c/parfum/damenduefte/duschpflege/010107
https://www.douglas.de/de/c/parfum/damenduefte/parfum/010106
[ ] Configuration: 4 Threads: Pages: 2012, Took:159.169141 s -> 6 min -> 100 Docs, threads_metrics: {109179: 134, 109182: 232, 109181: 192, 109183: 68}
[ ] Seeds:
https://www.douglas.de/de/c/parfum/damenduefte/duftsets/010111
https://www.douglas.de/de/c/parfum/damenduefte/koerperpflege/010108
https://www.douglas.de/de/c/parfum/damenduefte/duschpflege/010107
https://www.douglas.de/de/c/parfum/damenduefte/parfum/010106
[ ] Configuration: 4 Threads: Took:6483.38577 s -> 2 h:20min -> 1874 Docs, threads_metrics: {112515: 4418, 112508: 3442, 112506: 66, 112514: 3254, 112504: 4}
[ ] Seeds:
https://www.douglas.de/de/c/parfum/damenduefte/duftsets/010111
https://www.douglas.de/de/c/parfum/damenduefte/koerperpflege/010108
https://www.douglas.de/de/c/parfum/damenduefte/duschpflege/010107
https://www.douglas.de/de/c/parfum/damenduefte/parfum/010106
Issue I found that only one thread continues, and the others die! Or when one is done, I should reuse it.
Douglas Start 10:40 -> 12:53, docs 2902, only one thread survived!
Handel stale exception
Message: stale element reference: stale element not found (Session info: headless chrome=113.0.5672.126)
it stops the crawlers threads.Found that the crawler is trying to visit
Visited Links: 1775
links to only find100
docs!