SAML-Toolkits / python3-saml

MIT License
704 stars 309 forks source link

Some urls come with 403 response #364

Closed galynazholtkevych closed 1 year ago

galynazholtkevych commented 1 year ago

I see, python3-saml is using urllib in get_metadata classmethod of OneLogin_Saml2_IdPMetadataParser class. Some urls are going fine through this method and we can access data in our code.

However, we have problems with some public urls and I do not see the reason denying it. Somewhy urllib , that you're using, is not accepting that url without User-Agent header. With User-Agent it works. It can be due to ModSecurity in urllib, where it checks if the request comes from a bot. But the url is parsed fine with curl and requests, however.

Could you please catch such 403 errors in some way, so we get this fixed on our side?

My example that worked:

request = Request('some-url.com', headers={'User-Agent': 'Mozilla/5.0'})

response = urlopen(req)

Thanks!

tongshen-yong commented 1 year ago

Hi @galynazholtkevych. May I know how you got it working so I can get a hotfix going? Did you change this snippet?

` @classmethod def get_metadata(cls, url, validate_cert=True, timeout=None): """ Gets the metadata XML from the provided URL :param url: Url where the XML of the Identity Provider Metadata is published. :type url: string

    :param validate_cert: If the url uses https schema, that flag enables or not the verification of the associated certificate.
    :type validate_cert: bool

    :param timeout: Timeout in seconds to wait for metadata response
    :type timeout: int

    :returns: metadata XML
    :rtype: string
    """

`

galynazholtkevych commented 1 year ago

@tongshen-yong Thanks for quick reacting!

I actually was going to create a pr to your repository, was going to work on that, but if you can help with hotfix, it will be faster for sure . So basically yes, you got it right, that I wrapped it into Request object. Putting some details here.

I supposed to modify parse_remote and get_metadata classmethods like below:

  1. parse_remote: ` @classmethod def parse_remote(cls, url, validate_cert=True, entity_id=None, timeout=None, **kwargs): """ Gets the metadata XML from the provided URL and parse it, returning a dict with extracted data :param url: Url where the XML of the Identity Provider Metadata is published. :type url: string

    :param validate_cert: If the url uses https schema, that flag enables or not the verification of the associated certificate.
    :type validate_cert: bool
    
    :param entity_id: Specify the entity_id of the EntityDescriptor that you want to parse a XML
                      that contains multiple EntityDescriptor.
    :type entity_id: string
    
    :param timeout: Timeout in seconds to wait for metadata response
    :type timeout: int
    
    :returns: settings dict with extracted data
    :rtype: dict
    """
    idp_metadata = cls.get_metadata(url, validate_cert, timeout, headers=kwargs.pop('headers', None))
    return cls.parse(idp_metadata, entity_id=entity_id, **kwargs)

`

^^ to pass headers to get_metadata method, because we use parse_remote method to get the url data, basically it is one line of change passing headers to get_metadata.

And

  1. get_metadata

` @classmethod def get_metadata(cls, url, validate_cert=True, timeout=None, headers=None): """ Gets the metadata XML from the provided URL :param url: Url where the XML of the Identity Provider Metadata is published. :type url: string

    :param validate_cert: If the url uses https schema, that flag enables or not the verification of the associated certificate.
    :type validate_cert: bool

    :param timeout: Timeout in seconds to wait for metadata response
    :type timeout: int

    :returns: metadata XML
    :rtype: string
    """
    valid = False
    request = urllib2.Request(url, headers=headers or {})
    if validate_cert:
        response = urllib2.urlopen(request, timeout=timeout)
    else:
        ctx = ssl.create_default_context()
        ctx.check_hostname = False
        ctx.verify_mode = ssl.CERT_NONE
        response = urllib2.urlopen(request, context=ctx, timeout=timeout)

`

Here changed: param and docstring added named headers, and what is the most important that worked - I wrapped intp request url and headers and passed Request object to urlopen.

Thanks! Notify me please, if any help is needed, including pr making. Have a nice day! Regards, Halyna

tongshen-yong commented 1 year ago

Hi @galynazholtkevych. I'm not actively working on this repo actually. Perhaps you can open a PR :)

galynazholtkevych commented 1 year ago

So please , core reviewers, welcome to review my pr to python3-saml https://github.com/SAML-Toolkits/python3-saml/pull/366

pitbulk commented 1 year ago

PR merged, thanks for contributing.

galynazholtkevych commented 1 year ago

@pitbulk thank you! Please, may I know , when the new release with this fix will be published?

akira commented 1 year ago

+1 would be great to have a release as I'm also running into issues with this, thanks 🙏 @pitbulk