GateNLP / ultimate-sitemap-parser

Ultimate Website Sitemap Parser
https://mediacloud.org/
Other
181 stars 64 forks source link

This site is not working => "set()" as result #9

Closed chatelao closed 5 years ago

chatelao commented 5 years ago

Probably this site has a strange format or I called something wrong?

from usp.tree import sitemap_tree_for_homepage

tree = sitemap_tree_for_homepage('https://hls-dhs-dss.ch')
print(tree.all_pages())

The result:

2019-07-11 18:45:00,533 WARNING usp.tree [2344/MainThread]: Assuming that the homepage of https://hls-dhs-dss.ch is https://hls-dhs-dss.ch/
2019-07-11 18:45:00,534 INFO usp.fetchers [2344/MainThread]: Fetching level 0 sitemap from https://hls-dhs-dss.ch/robots.txt...
2019-07-11 18:45:00,534 INFO usp.helpers [2344/MainThread]: Fetching URL https://hls-dhs-dss.ch/robots.txt...
2019-07-11 18:45:00,821 INFO usp.fetchers [2344/MainThread]: Parsing sitemap from URL https://hls-dhs-dss.ch/robots.txt...
set()

Reading the robots.txt manually, I know there are two layers of sitemap.xml

pypt commented 5 years ago

Thanks, fixed in develop. Will probably release a new version soon.

chatelao commented 5 years ago

so cool, thank you a lot

pypt commented 5 years ago

0.2 released.

chatelao commented 5 years ago

Thanks a lot (I've got 0.3, right?)

chatelao commented 5 years ago

Did you see Google's release of the robots.txt parser?

https://opensource.googleblog.com/2019/07/googles-robotstxt-parser-is-now-open.html

pypt commented 5 years ago

Thanks, I'll take a look.

I think implementing robots.txt parser is easy enough to do on one's own. Main takeaway from Google's implementation is that they tolerate both Sitemap: and Site-map: annotations.

openbankingproject-ch commented 5 years ago

The new parser works great, my "wget" job is running very well with the data extracted.