c4software / python-sitemap

Mini website crawler to make sitemap from a website.
GNU General Public License v3.0
366 stars 110 forks source link

Limit search to path instead of domain? #50

Open 1kastner opened 5 years ago

1kastner commented 5 years ago

Could it be possible to restrict the search to a certain path? A bad example would be to restrict a search to http://google.com/maps/ and ignore results which are in other "subdirectories" of http://google.com/. Using "domain" for this purpose does not work.

c4software commented 5 years ago

Hi,

Sorry for the delay. You can do it via

--exclude "maps/"

But it has to be exhaustive.

You wan't something generic for all subfolders?

1kastner commented 5 years ago

Well, actually it is an include logic which is not yet implemented in https://github.com/c4software/python-sitemap/blob/master/main.py

davidcx89 commented 5 years ago

I agree that it would be cool to have an "include" function in the crawler. 1kastner, I think your phrase "A bad example" may have read the opposite way to crsoftware.

1kastner commented 5 years ago

@davidcx89 yeap, sorry for bad phrasing, I maybe should have put more effort on describing the issue.

If I'll find the time there might be a pull request somewhen soon.

c4software commented 5 years ago

Hi,

An include pattern is indeed a great idea. Something with reggex would be really great.

I will try to doing this quickly. Maybe this weekend.