ArchiveTeam / ArchiveBot

ArchiveBot, an IRC bot for archiving websites
http://www.archiveteam.org/index.php?title=ArchiveBot
MIT License
352 stars 72 forks source link

primary_netloc and primary_url use the parent instead of root URL #538

Open JustAnotherArchivist opened 1 year ago

JustAnotherArchivist commented 1 year ago

The effect is that if such an ignore is added later in the job, it won't have the expected effect. For example, a job for https://example.org/ comes across a link to https://example.net/ which further has a frame https://example.net/foo. If an ignore ^https?://(?!{primary_netloc}/) is added at the beginning, only the first URL is retrieved, but if it's added after the retrieval of https://example.net/, all three are retrieved even though the frame should be ignored. primary_netloc is already example.net at that point due to this bug, and so the ignore doesn't match.

This was introduced by 967d5aa6 while porting to wpull 2. The ignoracle tests are currently broken and disabled (4d3e4fc7) and need to be fixed first.