IdentityPython / pyFF

SAML metadata aggregator
https://pyff.io/
Other
50 stars 37 forks source link

Poor performances #191

Closed peppelinux closed 4 years ago

peppelinux commented 4 years ago

Problem

The first query on pyff takes roughly from 4 to 8 seconds.

Commands used

pyffd -p pyff.pid -f -a --loglevel=DEBUG --dir=`pwd` -H 0.0.0.0 -P 8001 --frequency=180 --no-caching pipelines/garr.fd

A Production example with RedisWhoosh

PYFF_STORE_CLASS=pyff.store:RedisWhooshStore pyffd -p pyff.pid -f -a --dir=`pwd` -H 0.0.0.0 -P 8001  pipelines/garr.fd

Using gunicorn

gunicorn --reload --reload-extra-file pipelines/garr.fd --preload --bind 0.0.0.0:8001 -t 600 -e PYFF_PIPELINE=pipelines/garr.fd -e PYFF_STORE_CLASS=pyff.store:RedisWhooshStore -e PYFF_UPDATE_FREQUENCY=600 -e PYFF_PORT=8001 --threads 4 --worker-tmp-dir=/dev/shm --worker-class=gthread pyff.wsgi:app

Consequences

I'll continue using pyff but as metadata downloader and validator. I cannot use this approach with Shibboleth entities, because standing on draft-young-md-query these are queried in a natural format, like

/entities/https:%2F%2Fcoco.release-check.edugain.org%2Fshibboleth

PySAML2 continue using the {sha1} format. Because of this I coded in few lines of code this: https://github.com/UniversitaDellaCalabria/Django-MDQ

It simply works, probably with much time I'll develop the aggregated entitities/ and the validUntil definition but actually I don't need them for a local MDQ for HA shibIdP replicas and other PySAML2 entities

Idea

Wouldn't it be better to decouple the MDQ service from pyff? Django-MDQ works on top of pyFF, more like a module.

leifj commented 4 years ago
  1. I don't see any such performance when I run my own tests with edugain.fd but if you can provide me with your garr pipeline I can try to reproduce the behaviour you are seeing.
  2. Regardless, it might be better to produce a mirrored MDQ instance for large-scale production environment. This approach is described in the dockumentation on readthedocs.io
peppelinux commented 4 years ago
  1. I don't see any such performance when I run my own tests with edugain.fd but if you can provide me with your garr pipeline I can try to reproduce the behaviour you are seeing.

That's the configuration https://github.com/ConsortiumGARR/idem-tutorials/blob/master/idem-community/HOWTO-Shibboleth/Identity-Provider/HOWTO%20Setup%20a%20MDX%20Server%20and%20configure%20it%20in%20Shibboleth%20IdP.md#configure-pyff

The problem Is some random errors that exposes status 500 instead of metadata to the requesters. They also have their cache, so they Will return for hours a missing result. Coupled with this the First query takes 7,5 seconds, this represent a problem in a web context.

  1. Regardless, it might be better to produce a mirrored MDQ instance for large-scale production environment. This approach is described in the dockumentation on readthedocs.io

Got It, I'm using a static serve with nginx as also purposed for documentation here but closed.

This static serve doesn't work with shibIdP and requests with clear entities (urlencoded) in the URL Path, so i developer Django-MDQ

leifj commented 4 years ago

Skickat från min iPhone

7 feb. 2020 kl. 18:15 skrev Giuseppe De Marco notifications@github.com:

 I don't see any such performance when I run my own tests with edugain.fd but if you can provide me with your garr pipeline I can try to reproduce the behaviour you are seeing. That's the configuration https://github.com/ConsortiumGARR/idem-tutorials/blob/master/idem-community/HOWTO-Shibboleth/Identity-Provider/HOWTO%20Setup%20a%20MDX%20Server%20and%20configure%20it%20in%20Shibboleth%20IdP.md#configure-pyff

The problem Is some random errore that exposes errore 500 instead of metadata to

If you are talking about 503 - that isn’t random at all. By design pyFF returns 503 (temporary unavailable) when it is loading because this is how you inform a proxy/client to come back later. Clients that cache 503s are not obeying the HTTP spec.

the requesters. They also have their cache, so they Will return for hours a missing result. Coupled with this the First query takes 7,5 seconds, this represent a problem in a web context.

Regardless, it might be better to produce a mirrored MDQ instance for large-scale production environment. This approach is described in the dockumentation on readthedocs.io Got It, I'm using a static serve with nginx as also purposed for documentation here but closed.

There are two supported methods for producing a static copy using pyFF.

This static serve doesn't work with shibIdP and requests with clear entities in the URL Path.

I don’t understand what this means. Can you provide an example?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

leifj commented 4 years ago

Skickat från min iPhone

7 feb. 2020 kl. 18:15 skrev Giuseppe De Marco notifications@github.com:

 I don't see any such performance when I run my own tests with edugain.fd but if you can provide me with your garr pipeline I can try to reproduce the behaviour you are seeing. That's the configuration https://github.com/ConsortiumGARR/idem-tutorials/blob/master/idem-community/HOWTO-Shibboleth/Identity-Provider/HOWTO%20Setup%20a%20MDX%20Server%20and%20configure%20it%20in%20Shibboleth%20IdP.md#configure-pyff

I notice that your instruction doesn’t mention running a local redis instance - the rediswhooshstore assumes a local redis on the default port. Is this running in your env?

peppelinux commented 4 years ago

First of all thank you for your time, really appreciate.

If you are talking about 503 - that isn’t random at all. By design pyFF returns 503 (temporary unavailable) when it is loading because this is how you inform a proxy/client to come back later. Clients that cache 503s are not obeying the HTTP spec.

Yes about 503 but yes-and-no about Shibboleth IdP behaviour. Once in production I cannot sindacate the design of Shibboleth, I just need something that works in all the case. If pyff have the already downloaded metadata it should give these, if metadata are available but a little bit older this does not mean that pyff should get a pause, the availability of the informations is the primary goal, nobody cares about what pyff is doing internally. If I deployed a service this must be highly available as many ways as possibile. During these 503 errors the SP cannot do authn requests to the IdP, like the SP didn't have been federated yet. This is a problem.

There are two supported methods for producing a static copy using pyFF.

Nothing can be faster then a NginX static serve, 36ms against more than seconds of pyff. This is a fact. I should recompile nginx with a lua or python module to reproduce what Django-MDQ is doing. It would be tecnically interesting. This would be needed because of pyff storage, it uses sha1 naming and not clear naming. I prefer sha1 too but young md draft 12 tell us something different and Shibboleth uses the clear naming. Bad news but carry on ...

I don’t understand what this means. Can you provide an example?

Asking for: https://coco.release-check.edugain.org/shibboleth in a pysaml2 entity the mdq query is built this way: http://mdq.url/entities/{sha1}79d76b6b9ff8536c15679b446aba05c1c529f44a

The clear naming, used in Shibboleth products by default, uses instead the clear names as these: http://mdq.url/entities/https%3A%2F%2Fcoco.release-check.edugain.org%2Fshibboleth

I notice that your instruction doesn’t mention running a local redis instance - the rediswhooshstore assumes a local redis on the default port. Is this running in your env?

Yes mister, Redis is running in backgroud, otherwise pyffd stop working at the begin with a connection error to redis-server.

leifj commented 4 years ago
  1. I suggest you (or me) talk to the shib folks about not cashing 503s because that is absolutely the wrong behavior and I have no doubt they would agree.

  2. I don't think pyFF should be pausing before returning responses especially when it has access to valid metadata and I have not seen that behavior myself. I will investigate using your examples!

  3. pyFF doesn't care how you serve a static copy - nginx, lighttpd, apache, all will work! I was talking about producing static filetrees from pyFF. There are two methods: either use the mirror-mdq.sh script in 'scripts' or use the new 'map' pipe which lets you loop over all your entities in batch mode. There is an example in 'examples/batch-mdq-loop.fd' in HEAD for using map to produce an offline signed tree along with json files for discovery.

  4. Thx for your explanation of clear vs sha1 (incidentally I'd use the term urlencoded names but nv mind). If you use the 'map' batch method above you can easily accommodate both representations. The map-batch method gives you a bit more control as it allows you to select file extensions etc and can be used to produce links etc that make both sha1 and urlencoded names work equally.

peppelinux commented 4 years ago

Ok I Will, thank you for the ack. At last:

1) I used the git master branch 2) another random exception happens regarding lxmltree validation. I don't have terminal but I'll Copy It here when available. When this internal exception happens, well, pyff stopped to give us mdq results untill the next refresh 3) seen https://github.com/IdentityPython/pyFF/blob/master/examples/batch-mdq-loop.fd Consider as mandatory a comprensive documentation of this operating mode with a description of the pipeline statements in It. The official documentation Is the key

leifj commented 4 years ago

Ok I Will, thank you for the ack. At last:

  1. I used the git master branch

thx!

  1. another random exception happens regarding lxmltree validation. I don't have terminal but I'll Copy It here when available. When this internal exception happens, well, pyff stopped to give us mdq results untill the next refresh

That does sound like a bug. Looking forwar to more detail.

  1. seen https://github.com/IdentityPython/pyFF/blob/master/examples/batch-mdq-loop.fd Consider as mandatory a comprensive documentation of this operating mode with a description of the pipeline statements in It. The official documentation Is the key

Yep. It is still very new.

peppelinux commented 4 years ago

That's my example configuration https://github.com/peppelinux/Satosa-saml2saml/tree/master/pyFF_example

Well, first query take always 9seconds! ... going to test batch-mdq-loop...

peppelinux commented 4 years ago

Ok, I tried to use batch-mdq-loop.fd. That's my configuration

I run first pyff pipelines/batch-mdq-loop-edugain.fd then pyffd -p pyff.pid -f -a --dir=pwd--loglevel=DEBUG -H 0.0.0.0 -P 8001 pipelines/batch-mdq-loop-edugain.fd

In the debug log I can see, on a mdq http request from shibidp the following 404 error

[08/Feb/2020:11:55:50]  MDServer pfx=None, path={base64}aHR0cHM6Ly9jb2NvLnJlbGVhc2UtY2hlY2suZWR1Z2Fpbi5vcmcvc2hpYmJvbGV0aA==, content_type=application/xml
DEBUG:cherrypy.error.139934034605280:[08/Feb/2020:11:55:50]  MDServer pfx=None, path={base64}aHR0cHM6Ly9jb2NvLnJlbGVhc2UtY2hlY2suZWR1Z2Fpbi5vcmcvc2hpYmJvbGV0aA==, content_type=application/xml

[...]

10.0.3.101 - - [08/Feb/2020:11:55:50] "GET /entities/https:%2F%2Fcoco.release-check.edugain.org%2Fshibboleth HTTP/1.1" 404 6025 "" "ShibbolethIdp/3.4.6 OpenSAML/3.4.5"
INFO:cherrypy.access.139934034605280:10.0.3.101 - - [08/Feb/2020:11:55:50] "GET /entities/https:%2F%2Fcoco.release-check.edugain.org%2Fshibboleth HTTP/1.1" 404 6025 "" "ShibbolethIdp/3.4.6 OpenSAML/3.4.5"

I resolved this 404 problem adding in the pipeline this

- select
- store:
     directory: ./garr

so, that's the log

10.0.3.101 - - [08/Feb/2020:12:17:19] "GET /entities/https:%2F%2Fcoco.release-check.edugain.org%2Fshibboleth HTTP/1.1" 200 - "" "ShibbolethIdp/3.4.6 OpenSAML/3.4.5"
INFO:cherrypy.access.140163049642080:10.0.3.101 - - [08/Feb/2020:12:17:19] "GET /entities/https:%2F%2Fcoco.release-check.edugain.org%2Fshibboleth HTTP/1.1" 200 - "" "ShibbolethIdp/3.4.6 OpenSAML/3.4.5"

but the content is not a valid samlxml response, it's more an error ...

peppelinux commented 4 years ago

The randoms exceptions, are caused due to typo errors in the requested entity name.

makes this in the pyff logs

<string>:0:0:ERROR:SCHEMASV:SCHEMAV_ELEMENT_CONTENT: Element '{urn:oasis:names:tc:SAML:2.0:metadata}EntitiesDescriptor': Missing child element(s). Expected is one of ( {http://www.w3.org/2000/09/xmldsig#}Signature, {urn:oasis:names:tc:SAML:2.0:metadata}Extensions, {urn:oasis:names:tc:SAML:2.0:metadata}EntityDescriptor, {urn:oasis:names:tc:SAML:2.0:metadata}EntitiesDescriptor ).
XML schema validation failed

and this in the Http Response to the client

  <h1>500 Internal Server Error</h1>
  The server has either erred or is incapable of performing the requested operation.<br/><br/>
peppelinux commented 4 years ago

That's the result with Django-MDQ, the first request from shibIdP and the second from pySAML2.

Selezione_802

It's really easy to test, just

git clone https://github.com/UniversitaDellaCalabria/Django-MDQ.git
cd  Django-MDQ
pip install -r requirements.txt

# configure the pyff store path and rename it to settingslocal.py 
nano django_mdq/settingslocal.py.example

./manage.py runserver 0.0.0.0:8001

If you have time to take a look at this we could develop this kind of decoupled service. It works out-of-the-box. If you could give a working and safe example ready for production and with relevant performance I would admit that django-mdq is quite useless.

Thank you for your time e never give up :)

leifj commented 4 years ago

Batch mode is not for pyffd!

peppelinux commented 4 years ago

Batch mode is not for pyffd!

My excuse, i Just Need a mdq service with responses available in 2seconds. You're the master, I'll follow you, let me see

leifj commented 4 years ago

The problem is with your pipeline (garr.fd) which forces a reload every time the server starts. Try using the "when update" pattern from examples/mdq.fd. That way the update will be run in the update process only as needed (when you run with a persistent backing store) and you should not see any delay until your server can start to answer queries.

peppelinux commented 4 years ago

Thank you Leifj, ack it

Correct me if I'm wrong. For a production ready and safe environment with a particular attention to performances, we should adopt the followings:

  1. A pyff batch that populates a shared storage/path, scheduled. It's only a downloader/validator no publish tasks.
  2. One or many gunicorn daemons, hopefully in HA, that doesn't download and validate things, just publish the MDX service using the entities shared through the network storage.

Could it be the production stop application?

peppelinux commented 4 years ago

The problem is with your pipeline (garr.fd) which forces a reload every time the server starts. Try using the "when update" pattern from examples/mdq.fd. That way the update will be run in the update process only as needed (when you run with a persistent backing store) and you should not see any delay until your server can start to answer queries.

Using this command

gunicorn --reload --reload-extra-file pipelines/garr.fd --preload --bind 0.0.0.0:8001 -t 600 -e PYFF_PIPELINE=pipelines/garr.fd -e PYFF_STORE_CLASS=pyff.store:RedisWhooshStore -e PYFF_UPDATE_FREQUENCY=600 -e PYFF_PORT=8001 --threads 4 --worker-tmp-dir=/dev/shm --worker-class=gthread pyff.wsgi:app

With this pipeline:

- when update:
    - load xrd garr-loaded.xrd:
      - ./pipelines/garr.xrd
    - select
    - store:
         directory: ./garr
    - publish:
         output: ./garr/garr-loaded.xml
    - stats
    - break
# MDX server
- when request:
    - select
    - pipe:
        - when accept application/xml:
            - xslt:
                stylesheet: tidy.xsl
            - first
            - finalize:
               cacheDuration: PT5H
               validUntil: P10D
            - sign:
                key: certificates/private.key
                cert: certificates/public.cert
            - emit application/xml
            - break
        - when accept application/json:
            - xslt:
                stylesheet: discojson.xsl
            - emit application/json:
            - break

The returning metadata appears as they were not signed ... Even removing RedisWhoosh ...

org.opensaml.saml.metadata.resolver.filter.FilterException: Metadata root element was unsigned and signatures are required.

Using instead this command they were signed, and returns in 1.360s (good enough for this topic!)

pyffd -p pyff.pid -f -a --dir=`pwd` --loglevel=DEBUG -H 0.0.0.0 -P 8001  pipelines/garr.fd 

I'd prefer to go in production with gunicorn, you know. I'm sure that could resolve this mistery about the missed signature with gunicorn.

leifj commented 4 years ago

Thank you Leifj, ack it

Correct me if I'm wrong. For a production ready and safe environment with a particular attention to performances, we should adopt the followings:

  1. A pyff batch that populates a shared storage/path, scheduled. It's only a downloader/validator no publish tasks.

There are several ways of doing this as I explained in an earlier comment: run a batch or run pyffd and export with mirror-mdq.sh. The latter I have in production producing md.swedenconnect.se for a couple of years now.

  1. One or many gunicorn daemons, hopefully in HA, that doesn't download and validate things, just publish the MDX service using the entities shared through the network storage.

Could it be the production stop application?

As I said ... I have both versions in various stages of production.

leifj commented 4 years ago

The problem is with your pipeline (garr.fd) which forces a reload every time the server starts. Try using the "when update" pattern from examples/mdq.fd. That way the update will be run in the update process only as needed (when you run with a persistent backing store) and you should not see any delay until your server can start to answer queries.

Using this command

gunicorn --reload --reload-extra-file pipelines/garr.fd --preload --bind 0.0.0.0:8001 -t 600 -e PYFF_PIPELINE=pipelines/garr.fd -e PYFF_STORE_CLASS=pyff.store:RedisWhooshStore -e PYFF_UPDATE_FREQUENCY=600 -e PYFF_PORT=8001 --threads 4 --worker-tmp-dir=/dev/shm --worker-class=gthread pyff.wsgi:app

With this pipeline:

- when update:
    - load xrd garr-loaded.xrd:
      - ./pipelines/garr.xrd
    - select
    - store:
         directory: ./garr
    - publish:
         output: ./garr/garr-loaded.xml
    - stats
    - break
# MDX server
- when request:
    - select
    - pipe:
        - when accept application/xml:
            - xslt:
                stylesheet: tidy.xsl
            - first
            - finalize:
               cacheDuration: PT5H
               validUntil: P10D
            - sign:
                key: certificates/private.key
                cert: certificates/public.cert
            - emit application/xml
            - break
        - when accept application/json:
            - xslt:
                stylesheet: discojson.xsl
            - emit application/json:
            - break

The returning metadata appears as they were not signed ... Even removing RedisWhoosh ...

org.opensaml.saml.metadata.resolver.filter.FilterException: Metadata root element was unsigned and signatures are required.

Using instead this command they were signed, and returns in 1.360s (good enough for this topic!)

pyffd -p pyff.pid -f -a --dir=`pwd` --loglevel=DEBUG -H 0.0.0.0 -P 8001  pipelines/garr.fd 

I'd prefer to go in production with gunicorn, you know. I'm sure that could resolve this mistery about the missed signature with gunicorn.

Which metadata was not signed exactly? The garr-loaded.xml or the metadata you get by talking to the MDQ endpoint?

peppelinux commented 4 years ago

Which metadata was not signed exactly? The garr-loaded.xml or the metadata you get by talking to the MDQ endpoint?

The latter, the root element is not signed as ShibIdP told me. this only happens if I run with gunicorn, with pyffd no matters.

leifj commented 4 years ago

do you get signed results if you run examples/edugain-mdq.fd using pyffd or gunicorn (if you modify the key and cert files that is)?

peppelinux commented 4 years ago

edugain-mdq.fd

Sorry for the late. Using this pipeline

- when update:
  - load:
    - https://mds.edugain.org
- when request:
  - select:
  - pipe:
    - when accept application/xml:
      - first
      - finalize:
          cacheDuration: PT12H
          validUntil: P10D
      - sign:
          key: certificates/private.key
          cert: certificates/public.cert
      - emit application/xml
      - break
    - when accept application/json:
      - discojson
      - emit application/json
      - break

With this command

gunicorn --reload --reload-extra-file pipelines/edugain-mdq.fd --preload --bind 0.0.0.0:8001 -t 600 -e PYFF_PIPELINE=pipelines/edugain-mdq.fd  -e PYFF_STORE_CLASS=pyff.store:RedisWhooshStore -e PYFF_UPDATE_FREQUENCY=600 -e PYFF_PORT=8001 --threads 4 --worker-tmp-dir=/dev/shm --worker-class=gthread pyff.wsgi:app --log-config debug.ini

Have the same result on ShibIdP

2020-02-13 09:49:25,090 - 127.0.0.1 - ERROR [org.opensaml.saml.metadata.resolver.impl.AbstractDynamicMetadataResolver:882] - Metadata Resolver FunctionDrivenDynamicHTTPMetadataResolver DynamicEntityMetadata: Metadata filtering problem processing new metadata
org.opensaml.saml.metadata.resolver.filter.FilterException: Metadata root element was unsigned and signatures are required.
    at org.opensaml.saml.metadata.resolver.filter.impl.SignatureValidationFilter.filter(SignatureValidationFilter.java:242)
2020-02-13 09:49:25,092 - 127.0.0.1 - INFO [org.opensaml.saml.common.binding.impl.SAMLMetadataLookupHandler:129] - Message Handler:  No metadata returned for https://coco.release-check.edugain.org/shibboleth in role {urn:oasis:names:tc:SAML:2.0:metadata}SPSSODescriptor with protocol urn:oasis:names:tc:SAML:2.0:protocol

With this command

/opt/shibboleth-idp/bin/mdquery.sh -e https://coco.release-check.edugain.org/shibboleth --saml2 -u http://localhost:8080/idp

Using instead pyffd

pyffd -p pyff.pid -f -a --dir=`pwd` --loglevel=DEBUG -H 0.0.0.0 -P 8001  pipelines/edugain-mdq.fd 

Got his

10.0.3.101 - - [13/Feb/2020:10:53:59] "GET /entities/https:%2F%2Fcoco.release-check.edugain.org%2Fshibboleth HTTP/1.1" 503 812 "" "ShibbolethIdp/3.4.6 OpenSAML/3.4.5"
INFO:cherrypy.access.139986651399280:10.0.3.101 - - [13/Feb/2020:10:53:59] "GET /entities/https:%2F%2Fcoco.release-check.edugain.org%2Fshibboleth HTTP/1.1" 503 812 "" "ShibbolethIdp/3.4.6 OpenSAML/3.4.5"
[13/Feb/2020:10:54:41] TOOLS.CACHING Reading response from cache

Modifing the pipeline this way

- when update:
    - load:
        - https://mds.edugain.org
    - select
    - store:
         directory: ./garr
    - publish:
         output: ./garr/garr-loaded.xml
    - stats
- when request:
  - select:
  - pipe:
    - when accept application/xml:
      - first
      - finalize:
          cacheDuration: PT12H
          validUntil: P10D
      - sign:
          key: certificates/private.key
          cert: certificates/public.cert
      - emit application/xml
      - break
    - when accept application/json:
      - discojson
      - emit application/json
      - break

Get the metadata, signed, with pyffd. But not with gunicorn, hope this helps

leifj commented 4 years ago

I believe the issue is related to MIME types. Can you find a way to figure out which MIME-type is used when shibd is requesting metadata?

peppelinux commented 4 years ago

That's the shibIdP Http request headers

HTTP_ACCEPT 'application/samlmetadata+xml, application/xml, text/xml'
PATH_INFO /entities/https://coco.release-check.edugain.org/shibboleth
HTTP_USER_AGENT ShibbolethIdp/3.4.6 OpenSAML/3.4.5

The problem is that gunicorn doesn't return any content instead ... Checked with a stupid http request from a web browser.

Mystery that will be solved sooner or later

leifj commented 4 years ago

Skickat från min iPhone

15 feb. 2020 kl. 01:59 skrev Giuseppe De Marco notifications@github.com:

 That's the shibIdP Http request headers

HTTP_ACCEPT 'application/samlmetadata+xml, application/xml, text/xml' PATH_INFO /entities/https://coco.release-check.edugain.org/shibboleth HTTP_USER_AGENT ShibbolethIdp/3.4.6 OpenSAML/3.4.5 — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

Thx! I will investigate. I’m pretty sure we are looking at an issue related to mime type negotiation

peppelinux commented 4 years ago

Ok, I got satisfaction from this thread and also from email with Scott and Rainer, so I think that we can close this issues if you agree.

I got the same performances that I have with django-mdq, also with pyffd. The Key was the 'when update' statements in the pipeline.

I have some lxc containers for tests (shib and slapd). I can share these with you to speedup tests. You have my email.