GSA / srt-fbo-scraper

Using machine learning to predict Federal IT procurement compliance with Section 508 Accessibility Standards
45 stars 21 forks source link

BUG: Attachment scraper fails when links direct to FedConnect #88

Closed csmcallister closed 5 years ago

csmcallister commented 5 years ago

FedConnect is a non-gov complement to fbo.gov and grants.gov. Some agencies apparently refer to them when linking their solicitation docs on fbo.gov

Expected Behavior

Script detects a FedConnect url, handles the redirect, and then scrapes the attachment urls.

Current Behavior

Script currently detects the redirect in get_fbo_attachments.FboAttachments.size_check() but then fails to handle it. Problem there is that there shouldn't be a return statement in the condition scope:

if h.status_code != 200:
    logging.error(f"Non-200 status code ({h.status_code}) getting file size with HEAD request from {url}. \
                              This means the file wasn't downloaded.")
    return False
elif h.status_code == 302:

But even if it handled the redirect, it would be redirected to a FedConnect page that will require some scrape logic to get the attachment(s).

Possible Solution

Steps to Reproduce (for bugs)

From the logs:

time="2018-12-14T04:02:31Z" level=info msg="[ERROR] Non-200 status code (302) getting file size
 with HEAD request from https://www.fedconnect.net/FedConnect/?
doc=28321319RI0000024&agency=SSA.                               This means the file wasn't downloaded." 
channel=stderr iteration=0 job.command="/usr/bin/env python3 /home/vcap/app/fbo.py" 
job.position=0 job.schedule="0 4 * * *" 

Context

It's unclear what proportion of FBO solicitations use FedConnect to host their docs, so the effects of not making this fix might be negligible.

csmcallister commented 5 years ago

These SO posts could be helpful: https://stackoverflow.com/questions/13147914/how-to-simulate-http-post-request-using-python-requests-module

https://stackoverflow.com/questions/31436679/how-to-download-a-file-from-a-link-which-is-javascript-enabled-in-python

csmcallister commented 5 years ago

Also, got this error from the logs:

[ERROR] Exception occurred getting file size with redirected HEAD request from 
https://www.fedconnect.net/FedConnect/?doc=28321319RI0000037\u0026agency=SSA:                                  
Invalid URL '/FedConnect/default.aspx?
ReturnUrl=%2fFedConnect%2f%3fdoc%3d28321319RI0000037%26agency%3dSSA\u0026doc=28321319RI0000037\u0026agency=SSA': 
No schema supplied. Perhaps you meant 
http:///FedConnect/default.aspx?ReturnUrl=%2fFedConnect%2f%3fdoc%3d28321319RI0000037%26agency%3dSSA\u0026doc=28321319RI0000037\u0026agency=SSA?\
csmcallister commented 5 years ago

@sbchrist Note that aca9a11 fixed the invalid HEAD request error mentioned above.

csmcallister commented 5 years ago

closed by af1d69c 🎉