Corriger l’erreur 500 à la première utilisation des review apps

n-b commented 2 years ago

À la première pose de Rdv sur une review app, on obtient une erreur 500. En redémarrant le container web sur scalingo, ça se met à fonctionner. C’est systématique ou presque; il ne semble pas y avoir d’autre problème de ce type sur d’autres vues; et ça fonctionne aussi systématiquement après un redémarrage.

Sur sentry.

Les logs:

2022-02-09 18:41:10.368699520 +0100 CET [web-1] I, [2022-02-09T17:41:10.368120 #16] INFO -- : [7302470a-0fd5-4c39-97b8-e849350bf6f9] Started GET "/admin/organisations/1/rdv_wizard_step/new?starts_at=2022-02-10T09%3A30%3A00%2B01%3A00&agent_ids%5B%5D=1&plage_ouverture_location=126+Avenue+Simon+Bolivar%2C+75019%2C+Paris" for 93.2.146.155 at 2022-02-09 17:41:10 +0000
2022-02-09 18:41:10.377421469 +0100 CET [web-1] I, [2022-02-09T17:41:10.376907 #16] INFO -- : [7302470a-0fd5-4c39-97b8-e849350bf6f9] Processing by Admin::RdvWizardStepsController#new as HTML
2022-02-09 18:41:10.377428563 +0100 CET [web-1] I, [2022-02-09T17:41:10.377009 #16] INFO -- : [7302470a-0fd5-4c39-97b8-e849350bf6f9] Parameters: {"starts_at"=>"2022-02-10T09:30:00+01:00", "agent_ids"=>["1"], "plage_ouverture_location"=>"126 Avenue Simon Bolivar, 75019, Paris", "organisation_id"=>"1"}
2022-02-09 18:41:10.399650816 +0100 CET [web-1] I, [2022-02-09T17:41:10.399147 #16] INFO -- : [7302470a-0fd5-4c39-97b8-e849350bf6f9] Completed 500 Internal Server Error in 22ms (ActiveRecord: 3.2ms | Allocations: 6041)
2022-02-09 18:41:10.401826991 +0100 CET [web-1] F, [2022-02-09T17:41:10.401436 #16] FATAL -- : [7302470a-0fd5-4c39-97b8-e849350bf6f9]
2022-02-09 18:41:10.401834759 +0100 CET [web-1] [7302470a-0fd5-4c39-97b8-e849350bf6f9] NoMethodError (super: no superclass method `starts_at=' for #<Rdv >):
2022-02-09 18:41:10.401835812 +0100 CET [web-1] [7302470a-0fd5-4c39-97b8-e849350bf6f9]
2022-02-09 18:41:10.401836831 +0100 CET [web-1] [7302470a-0fd5-4c39-97b8-e849350bf6f9] app/models/rdv.rb:97:in `starts_at='
2022-02-09 18:41:10.401853442 +0100 CET [web-1] [7302470a-0fd5-4c39-97b8-e849350bf6f9] app/form_models/concerns/admin/rdv_wizard_form_concern.rb:30:in `initialize'
2022-02-09 18:41:10.401856546 +0100 CET [web-1] [7302470a-0fd5-4c39-97b8-e849350bf6f9] app/controllers/admin/rdv_wizard_steps_controller.rb:49:in `new'
2022-02-09 18:41:10.401857396 +0100 CET [web-1] [7302470a-0fd5-4c39-97b8-e849350bf6f9] app/controllers/admin/rdv_wizard_steps_controller.rb:49:in `rdv_wizard_for'
2022-02-09 18:41:10.401858070 +0100 CET [web-1] [7302470a-0fd5-4c39-97b8-e849350bf6f9] app/controllers/admin/rdv_wizard_steps_controller.rb:13:in `new'

n-b commented 2 years ago

Je soupçonnais un problème d’accès à la base, une race condition dans laquelle le serveur PG ne serait pas encore prêt, mais en fait on est déjà connecté en tant qu’Agent à ce moment. On a donc déjà accédé à la base.

yaf commented 2 years ago

Est-ce qu'on tente un échange avec le support de Scalingo ou bien ce n'est pas nécessaire ? :thinking:

Est-ce que c'est lié à des PR où il y a des « migrations » ? Est-ce que ça passerait si on mettait dans le webhook un délai avant l'exécution ?

Je pose toutes ces questions ici pour les partager, mais je me dis que je pourrais aussi tenter des trucs... Ça mériterait de regarder comment reproduire à coup sûr.

n-b commented 2 years ago

Je crois que c’est sytématique. Dans #2131, il n’y a pas de migrations.

victormours commented 2 years ago

J'ai fait des essais sur cette histoire de restart, et j'ai l'impression que la piste la plus crédible serait de lancer un job de ScalingoAppRestarter de manière asynchrone via le first-script du fichier scalingo.json. Un truc de ce genre, mais avec un delayed job : https://github.com/betagouv/rdv-solidarites.fr/pull/2258/files#diff-1c5d110c7373f4586bcacf73aadb17fde97e9a247de07167f7732d24f152d360R23

yaf commented 1 year ago

J'ai l'impression que ça ne se produit plus. Je vais fermer le ticket.

adipasquale commented 3 weeks ago

je réouvre car ça se produit malheureusement encore :/

j’ai relu https://doc.scalingo.com/platform/app/postdeploy-hook et https://doc.scalingo.com/languages/ruby/rails/start#database-migration

notamment

The Ruby on Rails application starts up before the schema has changed, but does not receive a query. The way ActiveRecord works on Rails is that the schema is loaded into memory on the first query on a given table.

dans les logs je vois

 Postdeploy hook detected, starting 'bundle exec rake db:schema:load db:seed'
2024-06-17 09:49:27.640489574 +0200 CEST [manager] container [postdeploy-5069] (666fea86affe5702b72dcb91) started with the command 'bundle exec rake db:schema:load db:seed'
2024-06-17 09:49:49.343697528 +0200 CEST [postdeploy-5069] I, [2024-06-17T07:49:31.806613 #18]  INFO -- : [SKYLIGHT] [6.0.1] Unable to start, see the Skylight logs for more details
2024-06-17 09:49:50.826124972 +0200 CEST [manager] container [postdeploy-5069] (666fea86affe5702b72dcb91) has stopped
 Postdeploy hook succeeded, accepting deployment... 
 <-- https://demo-rdv-solidarites-pr4336.osc-secnum-fr1.scalingo.io -->

on voit que Skylight pense que le container postdeploy n’a pas démarré correctement 🤔

ma suspicion c’est qu’il y a une requête qui est fait aux containers web principaux avant que le postdeploy hook n’ait pu s’éxecuter

betagouv / rdv-service-public

Corriger l’erreur 500 à la première utilisation des review apps #2139