fisharebest / webtrees

Online genealogy
https://webtrees.net
GNU General Public License v3.0
429 stars 294 forks source link

OpenAI on Webtrees #4867

Closed FrankWarius closed 11 months ago

FrankWarius commented 11 months ago

OpenAi cralewed my website completely in 3 days from about 140 IP addresses

It is noticeable that no sitemap was ever crawled. This is strange, as this is the first visit noticed since my log evaluations began in December 2022.

In addition to the desired pages (INDI, FAM, SOUR, OBJE and here also experimentally INDI- and FAM-list), many pages were also grawled that have a NOINDEX tag.

It might be useful to add a NOFOLLOW tag to all Webtrees menu entries that refer to a NOINDEX page.

<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">

URI1 | URI2 | URI3 | 08. Aug | 09. Aug | 15. Aug | 16. Aug | Sum -- | -- | -- | -- | -- | -- | -- | -- tree | Warius | ancestors-tree-4 | 3 |   | 2 | 7 | 12   |   | anniversary-ics | 34 | 1 | 10 | 57 | 102   |   | branches | 141 |   | 45 | 248 | 434   |   | calendar | 293 |   | 113 | 483 | 889   |   | compact | 17 |   | 3 | 30 | 50   |   | descendants-tree-3 | 15 |   | 9 | 20 | 44   |   | family | 948 | 6 | 381 | 1559 | 2894   |   | family-book-2-5-0 | 3 |   |   | 5 | 8   |   | family-list | 121 |   | 49 | 188 | 358   |   | fan-chart-3-4-100 | 3 |   |   | 4 | 7   |   | hourglass-3-0 | 5 |   | 1 | 9 | 15   |   | individual | 2302 | 3 | 888 | 3841 | 7034   |   | individual-list | 160 |   | 45 | 277 | 482   |   | lifespans | 3 |   |   | 5 | 8   |   | media | 82 | 1 | 40 | 135 | 258   |   | note | 1 |   |   | 2 | 3   |   | note-list | 1 |   |   | 1 | 2   |   | pedigree-map-4 | 6 | 1 | 4 | 9 | 20   |   | pedigree-right-4 | 3 |   | 2 | 4 | 9   |   | place-list | 63 |   | 22 | 105 | 190   |   | relationships-0-99 | 10 |   | 5 | 18 | 33   |   | report | 5 |   | 4 | 10 | 19   |   | repository | 7 |   | 3 | 11 | 21   |   | search-advanced | 1 |   |   | 1 | 2   |   | search-phonetic |   |   |   | 1 | 1   |   | source | 248 |   | 94 | 410 | 752   |   | source-list | 1 |   |   | 1 | 2   |   | timeline-10 | 1 |   |   | 1 | 2   |   | (Leer) | 1 |   |   | 1 | 2 robots.txt |   |   | 26 | 2 | 10 | 42 | 80 module |   |   | 11 |   | 3 | 17 | 31 Sum |   |   | 4515 | 14 | 1733 | 7502 | 13764

fisharebest commented 11 months ago

This robot has already crawled my site.

that have a NOINDEX tag.

I have already added it to the blocklist :-)

https://github.com/fisharebest/webtrees/commit/698f77b0bad606388c0f9be1ab47e3cdaf6d5ff2

fisharebest commented 11 months ago

It might be useful to add a NOFOLLOW tag to all Webtrees menu entries that refer to a NOINDEX page.

We already do this. e.g.

https://github.com/fisharebest/webtrees/blob/f94b830f9a4ef11b0a336999817b29b5b6e0acf0/app/Module/ModuleChartTrait.php#L120

https://github.com/fisharebest/webtrees/blob/f94b830f9a4ef11b0a336999817b29b5b6e0acf0/app/Module/ModuleReportTrait.php#L73

fisharebest commented 11 months ago

I think that webtrees already does everything that you are asking.

If you can think of anything else, please create another issue for it.

FrankWarius commented 11 months ago

Thank you, that is correct. (Exception Contact see #4868)

Google has thousands of such links (Chart, Reports) stored, which are still regularly queried. They can't all be more than 5 years old.