cnumr / ecoindex_python_fullstack

Refactoring of ecoindex in one monorepo using polylith pattern
Other
10 stars 3 forks source link

[Bug]: Mismatch with nodes_count #81

Open arichard-info opened 4 weeks ago

arichard-info commented 4 weeks ago

What happened?

I think there is a mismatch with the nodes count metric and I can't find why.

When analyzing https://www.kiabi.com on ecoindex.fr, I have ~3500 elements. That sounds like a lot, and I can't reach that count when I compare them on my own.

For example, if I try to count all the elements from the console, I only get 790 nodes (~1700 if I scroll down) :

document.querySelectorAll('*').length

I understand that the official script uses playwright : https://github.com/cnumr/ecoindex_python_fullstack/blob/main/components/ecoindex/scraper/scrap.py#L133

So I created a very basic playwright script that counts the elements of a page :

test('test', async ({ page }) => {
  await page.goto('https://www.kiabi.com');
  const elements = await page.locator("*").all();
  await expect(elements.length).toBeLessThan(100);
});

I'm using exactly the same syntax as the ecoindex script

Result :

    Expected: < 100
    Received:   792

I don't understand why I'm so far away from the ecoindex.fr results. And in both cases I don't subtract the svg elements

Am I missing something or is there a problem with the playwright used by ecoindex.fr?

Project

Ecoindex Scraper

What OS do you use?

Mac

urls

No response

Relevant log output

No response

Code of Conduct

vvatelot commented 3 weeks ago

Hello @arichard-info to be more precise, here is the scenario played by ecoindex: https://www.ecoindex.fr/en/how-it-works/#analysis-methodology

Have you tried to run the complete scenario ?

arichard-info commented 3 weeks ago

Hello @vvatelot, thank you for the answer. Yes it's the same scenario I played.

By the way, the official ecoindex.fr scenario can't run in full on the site I gave as an example, nor on many other sites because of the cookie banners that often block scrolling until they've been accepted or declined.

In the case of my scenario, I add a step with playwright to accept third-party cookies in the banner. This way I can execute the rest of the scenario: scroll down and wait three seconds. So I'd expect more nodes than via the ecoindex.fr site. But as I've explained, I'm far from it.

I can't explain why I'm getting so many nodes on the homepage of the site I gave as an example (https://www.kiabi.com).

Whether using a playwright or pupeteer scenario, or even manually in the browser, I always get a much lower number of nodes than that returned by eco-index.fr.

vvatelot commented 3 weeks ago

I made tests with headless mode activated and deactivated.

Mode Node count
Headless ~3500
Headfull ~700

By default, ecoindex is running in headless mode. I don't know if I can make it work in headfull mode in a container...

But, in the end, I don't know why kiabi websites has such a difference. I exported the 2 har files if you want to investigate further: https://gist.github.com/vvatelot/12d8470de4ff83d586408f0225e6424b