aw-studio / laravel-indexer

0 stars 0 forks source link

Scraping problem with indexer #8

Open aw-gerrit opened 3 years ago

aw-gerrit commented 3 years ago

Use case

https://sag-sh.de

Issue: no hit although term exists on page

The search term „Albert-Schweitzer-Schule“ should throw a hit on following page: https://sag-sh.de/referenzschulnetzwerk/archiv (but doesen't)

On the page the term exists as follows (span in a) <a data-v-0c0bb767="" href="http://www.ass-wedel.de/" target="_blank" rel="noopener noreferrer" class="gtl-link"> <span data-v-0c0bb767="" class="font-semibold">Albert-Schweitzer-Schule</span></a>

jannescb commented 3 years ago

This is because

https://github.com/aw-studio/laravel-indexer/blob/f84ff3bfc4edbfd02b4292208353875a4b8a1cbb/src/Commands/CreateIndexCommand.php#L73

file_get_contents() wont render Javascript. The table in your example is a Vue-App.

@cbl

How much effort do you think would it take to add an optional chromium feature that could render each URL?

Alternative

We might consider implementing Browsershot and simply do:

Browsershot also can get the body of an html page after JavaScript has been executed:

Browsershot::url('https://example.com')->bodyHtml()
aw-gerrit commented 3 years ago

Could the @bot blade directive also be a solution for this? This would ensure a proper google index as well.

jannescb commented 3 years ago

@aw-gerrit

Yes, if you build a server rendered bot version and the @bot directive will trigger on the user agent of php it would work.