Closed davidem00 closed 6 years ago
Hi @davidem00, this project creates the JSON formatted data and hosts it on the web through an api service (as seen on https://api.fda.gov).
Regarding the underlying data, it's pulled in off the web from a handful of sources, which you can find in various pipeline folders. Looking at one of the simpler examples, the Device Premarket Approval pipeline , you can see the data source on line 32: a web-hosted zip file managed by the FDA. The pipeline scrapes the website to download the zip file (lines 34-43), unzips and cleans the data (lines 45-59), maps the data and translates it into json format (lines 61-83), and harmonizes the data with the other device endpoints (lines 85-113). Finally lines 116-122 to load the data into ElasticSearch. All of these tasks are managed by Luigi, as called on lines 125-126.
Let me know if you have any further questions, I'll leave this issue open for a few days.
Ok, thanks. I'm specifically interested in data from the output of the /label endpoint.
Since I don't see a single pipeline for that, I'm guessing it's assembled from multiple sources to be presented in the form it is.
Which set of pipelines correspond to that endpoint? (I don't need Events) at this stage.
While I realize it is peeking under the hood, I'm trying to see if understanding the underlying data will give me a better sense of how to parse out the results of the API that are clumped together into single values.
The pipeline that corresponds to the Drug Label endpoint is here: https://github.com/FDA/openfda/blob/master/openfda/spl/pipeline.py
Does this project just obtain the current JSON downloads from the live openFDA api, and then do something new with them?
I saw another issue that makes mention of some "underlying" XML data. Where does that reside?