GSA / data.gov

Main repository for the data.gov service
https://data.gov
Other
616 stars 98 forks source link

🍀 Document the FOUR Data Pipelines #4433

Closed nickumia-reisys closed 1 year ago

nickumia-reisys commented 1 year ago

User Story

In order to inform existing and new harvesting processes and procedures, the Data.go Architect Team wants to document the FOUR pipelines that all harvesting travels through. These pipelines will either be (1) optimized in the current system or (2) fed into building a better new system from the start.

Acceptance Criteria

Background

Security Considerations (required)

...

Sketch

nickumia-reisys commented 1 year ago

Note: TWO harvesting pipelines have been deprecated (I believe both of these are FGDC/ISO, but not sure):

nickumia-reisys commented 1 year ago

Comment is in history. I deleted it to make the ticket cleaner. See diagrams below for the most up to date information.

btylerburton commented 1 year ago

Link to MD Translator spike: https://github.com/GSA/data.gov/issues/4200

nickumia-reisys commented 1 year ago
DCAT Pipeline initial pass complete.
![dcat](https://github.com/GSA/data.gov/assets/85196563/a149c4a1-190f-4525-8ba0-3247fc1afe40)

... Moving on to file xml FDGC/ISO

Just as a random note, our DCAT code is much more unorganized compared to the spatial upstream code...

nickumia-reisys commented 1 year ago
Single XML Pipeline initial pass complete.
![single_xml](https://github.com/GSA/data.gov/assets/85196563/aefb436c-3f99-4aa9-bad9-79ab1325863f)

... Moving on to file xml waf FDGC/ISO tomorrow

nickumia-reisys commented 1 year ago
WAF XML Pipeline initial pass complete.
![waf_xml](https://github.com/GSA/data.gov/assets/85196563/eeb25e72-5ca3-4d40-9277-f0fc0c7794f6)

... Moving on to api json ARCGIS next

nickumia-reisys commented 1 year ago
ArcGIS Pipeline initial pass complete.
![arcgis](https://github.com/GSA/data.gov/assets/85196563/e977564f-c274-410e-941b-01e0d181cfd9)

... I'm done? 🎉

nickumia-reisys commented 1 year ago

The diagrams in the comments above represent the core of the harvesting optimization problem. What happens when.. What errors are not being captured... What assumptions are made that fail to be true.. The next step is reviewing the code, abstracting it into meaningful chunks, testing the functionality, preserving the best parts and fixing the broken parts. One of the corner stones of implementing a new version of this code deals with the following requirement:

1.2.3 Data.gov should be able to adapt to new data formats not originally accounted for in its design.

The controller diagram highlights some high-level abstractions for input/output definitions. However, for example, within the extract component, we want to be able to support multiple file types (and possibly new ones, i.e. rdf), so creating an abstraction for between "data download" and "data parsing/input" would allow us to hook in a new file type. Further discussion will ensue and we'll answer a lot of open questions. From there, it'll be easier to start implementing and executing on this work.