ScottMansfield / widow

Distributed, asynchronous web crawler
GNU Lesser General Public License v2.1
26 stars 4 forks source link

a tags with img tags containing the same image should not be sent back to the fetch stage #7

Closed ScottMansfield closed 9 years ago

ScottMansfield commented 9 years ago

The href attribute of the a tag should be recorded as an outgoing link. The href attribute of the a tag should NOT be sent back to the fetch stage.

The src attribute of the img tag should be recorded as an img link.

I can probably filter the outLinks collection by the imgLinks collection to exclude items.

ScottMansfield commented 9 years ago

This was actually incorrect, as the links and contained images were different sizes. I solved this by doing a HEAD request on all links pulled out and excluding image/*