AyOK-Code / oscn_scraper

MIT License
1 stars 1 forks source link

Throw error if scraper returns a captcha form #21

Open sgelbart opened 2 years ago

sgelbart commented 2 years ago

Given that too many requests have been made to oscn and there is a captcha shown or invalid content shown When the scraper gets the html of the page raise a custom exception (so that the html is not saved or parsed).

You will need to update this project as well as the oscn project to complete end to end testing.

End to end testing:

  1. Ensure there is a lockout by hitting 15 or more pages within a minute (if you have already filled out the captcha it may take up to 1000 to trigger it again)
  2. Run a scraper in sidekiq
  3. Confirm that the job fails in sidekiq with the correct error message
  4. Confirm that the html is not saved to an html table

Notes:

sgelbart commented 2 years ago

Sample html


<html>
  <head>
    <meta name="viewport" content="width=device-width, minimum-scale=1, initial-scale=1">  
    <title>reCAPTCHA demo: Simple page</title>
    <script src="https://www.google.com/recaptcha/api.js" async defer></script> 
    <style>
        .form_container {
            padding-top: 2rem;
            width: 304px;
            text-align: center;
            margin: auto;
        }   
        .form_container input[type='submit'] {
            padding: 0.75rem;
            text-transform: uppercase;
        }
    </style>
  </head>
  <body>
    <div class="form_container">
    <form action="/recaptcha/recaptcha.aspx" method="POST">
      <div class="g-recaptcha" data-sitekey="6Ldu8X0UAAAAAMhAy59I5kHOnLqZ6xlCipROyOZE"></div>
      <br/>
      <input type="submit" value="Submit">
      <input type='hidden' id='source_uri' name='source_uri' value='L2RvY2tldHMvR2V0Q2FzZUluZm9ybWF0aW9uLmFzcHg/ZGI9dHVsc2EmbnVtYmVyPVRSLTIwMjItNzQzOQ=='>
      </script>
    </form>
    </div>
  </body>
</html>