co0p / x-scrap

scrapping the kreuzwerker blog list for fun
0 stars 2 forks source link

Bug: wrong counts with multiple urls #8

Open glenacota opened 1 year ago

glenacota commented 1 year ago

single url#1

> go run cmd/xscrap/main.go -urls https://kreuzwerker.de/post/aws-summit-2022-berlin-have-some-pie -tags Elasticsearch

url: https://kreuzwerker.de/post/aws-summit-2022-berlin-have-some-pie
-----------------------
Elasticsearch: 2

single url#2

> go run cmd/xscrap/main.go -urls https://kreuzwerker.de/post/opensource-is-fun -tags Elasticsearch

url: https://kreuzwerker.de/post/opensource-is-fun
-----------------------
Elasticsearch: 2

single url#3

> go run cmd/xscrap/main.go -urls https://kreuzwerker.de/post/exploring-logging-strategies-with-the-elastic-stack -tags Elasticsearch

url: https://kreuzwerker.de/post/exploring-logging-strategies-with-the-elastic-stack
-----------------------
Elasticsearch: 11

but url#1+url#2+url#3

> go run cmd/xscrap/main.go -urls https://kreuzwerker.de/post/aws-summit-2022-berlin-have-some-pie,https://kreuzwerker.de/post/opensource-is-fun,https://kreuzwerker.de/post/exploring-logging-strategies-with-the-elastic-stack -tags Elasticsearch

url: https://kreuzwerker.de/post/aws-summit-2022-berlin-have-some-pie
-----------------------
Elasticsearch: 2

url: https://kreuzwerker.de/post/opensource-is-fun
-----------------------
Elasticsearch: 6

url: https://kreuzwerker.de/post/exploring-logging-strategies-with-the-elastic-stack
-----------------------
Elasticsearch: 39
glenacota commented 1 year ago

it seems the issue is connected to the reuse of the colly instance (https://github.com/co0p/x-scrap/blob/master/cmd/xscrap/main.go#L15).

In fact, by resetting the c.found field in https://github.com/co0p/x-scrap/blob/master/infra/scraper/colly.go#L24, there is no count carry over between subsequent urls. Another problem remain, though: the html content of the 2nd url is fetched twice, doubling the number of found tags; the html content of the 3rd url if fetched three times; and so on...

By re-initialisating completely the Colly.collector field for every url, the html content of every url is fetched only once.