DanNBullock / USG_grants_crawl

An exploration of federal (and some non-federal) grant programs and awards relating to Open Science and Open Science Infrastructure, using data from grants.gov, NIH, NSF, DOE. Implemented in python & jupyter notebooks.
2 stars 1 forks source link

downloading attachment files programmatically #2

Open DanNBullock opened 1 year ago

DanNBullock commented 1 year ago

The full text of grants from grants.gov is often available in the "Related Documents" tab of specific grant listings. However this is contained within an ARIA / javascript element which seems inaccessible to conventional crawling / scraping methods.

An alternate method would be finding how the documents themselves are stored and trying to go directly after them. Inspection of an example "Related Documents" page reveals the following:

From the page source of an attachments page

<html>
<body>
<!--StartFragment-->

function downloadAttachment( attId ) {
--
  | var tag = "downloadAttachment() ";
  | try {
  | var url = '/grantsws/rest/opportunity/att/download/' + attId;
  | //alert( "url: " + url );
  | downloadFile( url, 'attachmentDownload' );
  | } catch ( error ) {
  | alert( tag + error );
  |  
  | }

<!--EndFragment-->
</body>
</html>

Suggesting that

'/grantsws/rest/opportunity/att/download/' + attId

is a valid rest API target.

Inferring from documentation provided here full stem of API call (in that case) would be:

https://www.grants.gov/grantsws/rest/opportunities/search/cfda/totals

So perhaps in this case:

'https://www.grants.gov/grantsws/rest/opportunity/att/download/' + attId

Unfortunately, no list linking the attIDs with specific 'opportunityIDs' exists, so unclear how to link specific documents to specific grants. May simply have to brute force iterate through documents. Six digit number ID for document (e.g. '324381') provides rough order of magnitude of documents IDs that may need to be brute forced.

DanNBullock commented 1 year ago
'https://www.grants.gov/grantsws/rest/opportunity/att/download/324381'

Tested and working

DanNBullock commented 1 year ago

Relevant page code for getting a listing of the documents?

<html>
<body>
<!--StartFragment-->

function displayRelatedDocumentsTables() {
--
  | var tag = "displayRelatedDocumentsTables() ";
  | log( tag );
  | var htmlStr = '';
  | var hasDocURLs = false;
  | var hasChangeComments = false;
  |  
  | if ( OPP_DETAILS.hasOwnProperty( "synAttChangeComments" ) ) {
  | var comments = OPP_DETAILS.synAttChangeComments;
  |  
  | if ( comments.length > 0 ) {
  | hasChangeComments = true;
  | if ( PRINT ) {
  | htmlStr = '<p><strong>Notification History:</strong></p>';
  | setStrValue( 'changeCommentsMsg', htmlStr );
  | document.getElementById( "changeCommentsMsg" ).style.display = "block";
  | htmlStr = createChangeCommentsTableForPrint( comments );
  | } else {
  | htmlStr = createChangeCommentsTable( comments );
  | }
  | setStrValue( 'changeCommentsTable', htmlStr );
  | }
  | }
  |  
  | htmlStr = '';// clear
  | if ( hasDocURLs ) { htmlStr = '<br /><br />'; }// if
  |  
  | if ( OPP_DETAILS.hasOwnProperty( "synopsisDocumentURLs" ) ) {
  | var urls = OPP_DETAILS.synopsisDocumentURLs;
  |  
  | if ( urls.length > 0 ) {
  | hasDocURLs = true;
  | if ( PRINT ) {
  | htmlStr = '<p><strong>Link(s):</strong></p>';
  | setStrValue( 'synopsisDocumentURLsMsg', htmlStr );
  | document.getElementById( "synopsisDocumentURLsMsg" ).style.display = "block";
  | }// if
  |  
  | htmlStr = createSynopsisDocURLsTable( urls );
  | setStrValue( 'synopsisDocumentURLsTable', htmlStr );
  | }// if
  |  
  | }// if
  |  
  | htmlStr = '';// clear
  | if ( hasDocURLs ) { htmlStr = '<br /><br />'; }// if
  |  
  | if ( oppDetailsHasAtts() ) {
  | var folders = OPP_DETAILS.synopsisAttachmentFolders;
  |  
  | if ( PRINT ) {
  | htmlStr += '<p><strong>Attachment(s):</strong></p>';
  | setStrValue( 'relatedDocumentsMsg', htmlStr );
  | document.getElementById( "relatedDocumentsMsg" ).style.display = "block";
  | }// if
  |  
  | htmlStr = createSynopsisAttFoldersTable( folders );
  | setStrValue( 'relatedDocumentsTable', htmlStr );
  |  
  | }// if
  |  
  | }// displayRelatedDocumentsTables

<!--EndFragment-->
</body>
</html>
DanNBullock commented 1 year ago

https://www.grants.gov/web/grants/s2s/grantor/web-services/get-related-document-details.html

DanNBullock commented 1 year ago

https://www.grants.gov/grantsws/rest/opportunity/details?oppId=262149&dataType=json

results in 405

DanNBullock commented 1 year ago

solved: https://github.com/analyticsbot/Python-Code---Part-8/blob/031f432044902be694858b4f871a1f503c60fb94/grants.gov/open/get_data.py