c4software / python-sitemap

Mini website crawler to make sitemap from a website.
GNU General Public License v3.0
366 stars 110 forks source link

No URLs found #57

Open exportio opened 5 years ago

exportio commented 5 years ago

Number of found URL : 1 Number of links crawled : 1

python main.py --domain https://www.domain.com --output sitemap.xml --report

<urlset
      xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
      xmlns:image="http://www.google.com/schemas/sitemap-image/1.1"
      xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
      xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9
            http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd">

</urlset>
c4software commented 5 years ago

Hi,

Interesting… I don't have the problem here. What is your python version ?

Capture d’écran 2019-03-21 à 09 25 18

c4software commented 5 years ago

@samboustani Problem still present ?

GovetaXV commented 5 years ago

Same problem here.

GovetaXV commented 5 years ago

try this url: https://paperarchive.space/

c4software commented 5 years ago

@GovetaXV Hi,

Thanks for the link. Unfortunately the current version of python-sitemap doesn't support « full javascript » website, this is why the paperarchive.space doesn't work.

Sorry

ishannaktode commented 5 years ago

+1 Same issue No error log

mgifford commented 4 years ago

This looked pretty hopeful, but didn't work for me either. This isn't a full headless site by any means.

$ python3 main.py --domain https://canada.ca --output sitemap.xml --report
Number of found URL : 1
Number of links crawled : 1
Mikes-MBP-3:python-sitemap mikegifford$ cat sitemap.xml 
<?xml version="1.0" encoding="UTF-8"?>
<urlset
      xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
      xmlns:image="http://www.google.com/schemas/sitemap-image/1.1"
      xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
      xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9
            http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd">

</urlset>

But maybe this helps.

$ python3 main.py --domain https://canada.ca --output sitemap.xml --debug
INFO:root:Start the crawling process
INFO:root:Crawling #0: https://canada.ca
DEBUG:root:https://canada.ca ==> <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:749)>
INFO:root:Crawling has reached end of all found links