bikeactuary / repoRter.nih

R Interface to the National Institute of Health's (NIH) Reporter API v2
Other
5 stars 1 forks source link

search for grants with null project start dates #1

Closed jssdenton closed 2 years ago

jssdenton commented 2 years ago

I’m interested in pulling values via repoRter.nih for which project_start_date is null or NA. I’ve noticed this occurs in a not-insignificant number of grants over the timeframe I am analyzing. I’ve tried entering “NA”, “null”, and “” in the ranges for project start. The first two options error out, and the third appears to return all grants for a given fiscal year. Is there a way to specify text in the project start date that says only return entries for which this value is empty?

bikeactuary commented 2 years ago

@jssdenton I looked into the RePORTER side and see that it is not possible to specify NULL values for most (maybe all?) fields - haven't identified any exceptions.

You must be analyzing fairly recent data. I can say that If you look back far enough, there will be no missing project_start_dates and no missing award_notice_dates. If you describe what you are trying to do in more detail and maybe I can suggest a path for you. In the worst case, you need to pull every record existing in the recent fiscal year(s) (note: currently NIH has awards in up to FY 2023) of interest and then subset the NULL/NA. Of course pulling an entire FY may mean result sets ~100K rows in size. I've found award_notice_date to be a more useful field for splitting up large requests into individual result sets < 10K (API max records that may be retrieved from any result). I don't believe there have been any months with > 10K in the month in history, so incrementing over calendar month is the way for now, but it is getting near where we will requite sub-month intervals to stay below 10K.

I have code at end of vignette demonstrating this hack using award_amount - maybe tweak that to utilize award_notice_date...

jssdenton commented 2 years ago

Thanks for checking on this. The data I had been working with for comparison with the API output was the preassembled grant info (CSVs by month/year) that I had been pulling from the static URL. In my last pull from March, I had ~530K rows, for all grants assembled from the monthly/FY-based CSVs via bind_rows() %>% distinct().

Filtering these by grants for which project_start is NA/empty produced 80K rows. Taking this full CSV data and doing a daff::data_diff(ids="application_id") showed that there were application_ids missed from the API call, versus the prepackaged CSVs, so it does appear that there are still project_starts and other fields that are NA.

What I have done for my API search is a call that runs project_starts by month, 1970-present, with an additional fiscal_years filter over the date range of interest, 2015-2022. This appears to produce individual chunks no larger than 5K.

bikeactuary commented 2 years ago

preassembled grant info (CSVs by month/year) that I had been pulling from the static URL. In my last pull from March, I had ~530K rows, for all grants assembled from the monthly/FY-based CSVs via bind_rows() %>% distinct().

Can you share code (+just link to specific csv for compare) to do this so I can investigate? Also, where you say "monthly/FY-based" do you mean the full Fiscal Year csv files? https://reporter.nih.gov/exporter

jssdenton commented 2 years ago

Basically (I think) yes, re: the URL you mention. The link you reference is I think the newest version of the URL I had been using, which I incorporated from the old nihexporter R package by jayhesselberth: "https://exporter.nih.gov/CSVs/final/" Based on recent attempts that return empty files when the old URL is queried, I think the old static archive might have been entirely replaced with the new URL, which appears to be dynamic.

The pull on the old URL was done with a variation of the nihexporter shell script provided in that package. The bound rows CSV is very large, even as a gz, but I have attached the application_ids corresponding to empty project_starts in the bound CSV. Some also have empty project_ends, as well.

When I tried assembling a query list on these ids using make_req(), I got a "if you receive a non-200" warning, and noticed the appl_ids were contained inside boxes, e.g. "application_ids":[], ... . Do these need <> unbox?

empty_starts.csv

bikeactuary commented 2 years ago

Just one point of validation, while I wait for your MRE: If you use the front end today (2022-06-28 10AM EST) and request all projects for fiscal years 2021 (leaving rest of arguments to default) you get back 82,661 project results:

Reproducing this in repoRter.nih obtains the same result:

> req2 <- make_req(criteria = list(fiscal_years = 2021))
This is your JSON payload:
{
    "criteria": {
        "fiscal_years": [
            2021
        ],
        "use_relevance": false,
        "include_active_projects": false,
        "exclude_subprojects": false,
        "multi_pi_only": false,
        "newly_added_projects_only": false,
        "sub_project_only": false
    },
    "offset": 0,
    "limit": 500
}

If you receive a non-200 API response, compare this formatting (boxes, braces, quotes, etc.) to the 'Complete Payload' schema provided here:
https://api.reporter.nih.gov/?urls.primaryName=V2.0#/Search/post_v2_projects_search 

> df2 <- get_nih_data(req2, flatten_result = TRUE, max_pages = 2)
Retrieving first page of results (up to 500 records)
max_pages set to 2 by user. Result set contains 166 pages. Only partial results will be retrieved.
Retrieving results 501 to 1000 of 82661
bikeactuary commented 2 years ago

I got a "if you receive a non-200" warning,

this is not a warning, it's a message which is printed with every call to make_req

When I tried assembling a query list on these ids using make_req()

I don't understand what you're attempting here - API and this package do not allow you to search for/pull specific appl IDs.

Can you share a MRE (minimum reproducible example) - code which I can run and will reproduce the unexpected behavior you are describing? It sounds like the issue may be on the CSV side you are comparing against.

bikeactuary commented 2 years ago

The link you reference is I think the newest version of the URL I had been using, which I incorporated from the old nihexporter R package by jayhesselberth: "https://exporter.nih.gov/CSVs/final/" Based on recent attempts that return empty files when the old URL is queried, I think the old static archive might have been entirely replaced with the new URL, which appears to be dynamic.

Also, the old URL you share is simply redirected to the current URL I shared. You are pulling the files I pointed to - full fiscal year lists.

jssdenton commented 2 years ago

Perhaps we are talking about slightly different things. I'm interested in trying to fill in missing values after a query using specified starts and ends, which exclude the NA starts/NA ends. Searching by FY appears to pick up everything, as you say, at the cost of rapidly approaching the API limit. Here is an example of the NAs in project_start using the CSV I provided:

library(repoRter.nih) library(tidyverse) library(lubridate)

empties <- read_csv("~/empty_starts.csv")

e_test <- empties$application_id[1:100]

queries_empty <- lapply(1:length(e_test), function(x) { make_req(criteria = list( appl_ids = as.numeric(e_test[x]) ) ) })

test_list <- lapply(1:length(queries_empty), function(x) { get_nih_data(queries_empty[[x]]) })

test_bind <- bind_rows(test_list) ##All project_starts are NA

jssdenton commented 2 years ago

So the above would not be picked up by my current search approach that specifies project_start explicitly by month.

re: your comment about the redirect--the last time I tried to run that shell script was in March, but at that time, running the script produced a ~4Kb "empty" file, and it looked like the static URLs for each year were gone.

bikeactuary commented 2 years ago

I see I missed that appl_ids is in fact a parameter in the API and in my package lol. Never used it myself...

Anyway, you're attempting to lapply over a vector of appl_ids. Meaning, you are sending a query for each indiivdual ID. No iterating is necessarily required - pass the numeric vector of IDS to appl_ids criteria...

req <- make_req(criteria = list(appl_ids = c(9157375, 9160924)))
df <- get_nih_data(req, flatten_result = TRUE)

if you have >10K appl_ids, then send requests for blocks of 10K at a time for (i in 1:ceiling(length(appl_ids)/1e4))

jssdenton commented 2 years ago

Ah great point. Yes, I was trying to cobble it together and was not thinking about efficiency...

This appl_id list is obviously missing everything since March or so, and I'm still wondering about how to balance the top-down (FY getting everything; too many results) and bottom-up (explicit project_starts; missing NAs, possibly not queryable directly). It seems odd to me that the API does not allow the empties for NAs to be included as a search possibility, but I must admit I have limited experience with the requirements of API infrastructure.

bikeactuary commented 2 years ago

Ok I understand now. And I would not say this is a requirement of API infrastructure generally, it just seems to be how NIH/RePORTER is set up.

I think we need to start with the domain info: What is the meaning of a missing project_start_date? When does a record appear in the API back-end? Can it appear with NULL project start, and if so have the field filled in later? When is it filled?

I went down this rabbit hole a few months back when building this but don't recall the answer here. You may want to dig through the API docs and other info online. I recall there were some specific agencies/codes that were treated differently in some respects...

It may be that a particular value(s) in some other field(s) may deterministically associate to missing project start date. If so and you can find them, then maybe specifying those values in the corresponding API param will return your NA. Perhaps take that larger spreadsheet, encode the features and run a quick recursive partitioning/random forest on a missing indicator target (obviously, dropping the start date itself as a feature). If you find any association rules which perfectly or near-perfectly return the missing records, let me know and I will encode the rules into the library and expose it with a simple "return_na_only" argument...

bikeactuary commented 2 years ago

This appl_id list is obviously missing everything since March or so,

Are you saying that the NULL project_start_dates all correspond to award_notice_dates > ~March 2022? I don't understand "everything since March or so" - values since march in what field?

jssdenton commented 2 years ago

I like the idea of examining possible correlations. I'll look into this and let you know.

Sorry for not being clear, re: "since March." What I meant was that the list of application_ids having project_start == NA that I provided as the CSV was derived from the last full CSV bind I had done in March, and that there were likely now more application_ids that also had NA project_starts.

bikeactuary commented 2 years ago

Oh ok. Well to this inference:

CSV bind I had done in March, and that there were likely now more application_ids that also had NA project_starts.

I think we can reject that out of hand, since the last time the file was updated was in February. Also, from the header on that CSV page:

Note: The consolidated FY2021 project file is planned for release by end of January annually. This time is needed for the incorporation of the latest grant information as well as contracts, intramural projects, and interagency agreements. The consolidated FY2021 files will be updated againfollowing the release of the President’s budget for FY2023, typically expected in February each year, to include the NIH Spending Categories data (RCDC). Once announced, it can take up to three weeks to be posted.

So they do an update in late feb/early march (was late feb this year). I understand that is where you get your new records from. I'm not sure the meaning of missing dates etc, but I'm gonna close this issue now since we have confirmed we get the same result from package as we get from API interface. Please open a new issue if you find anything interesting though relating to the API interface missing records which are reported via the CSVs.

thx