Closed nickumia-reisys closed 1 year ago
Note: TWO harvesting pipelines have been deprecated (I believe both of these are FGDC/ISO, but not sure):
api csw ??? ???
api cms ???
Comment is in history. I deleted it to make the ticket cleaner. See diagrams below for the most up to date information.
Link to MD Translator spike: https://github.com/GSA/data.gov/issues/4200
... Moving on to file xml FDGC/ISO
Just as a random note, our DCAT code is much more unorganized compared to the spatial upstream code...
... Moving on to file xml waf FDGC/ISO
tomorrow
... Moving on to api json ARCGIS
next
... I'm done? 🎉
The diagrams in the comments above represent the core of the harvesting optimization problem. What happens when.. What errors are not being captured... What assumptions are made that fail to be true.. The next step is reviewing the code, abstracting it into meaningful chunks, testing the functionality, preserving the best parts and fixing the broken parts. One of the corner stones of implementing a new version of this code deals with the following requirement:
1.2.3 Data.gov should be able to adapt to new data formats not originally accounted for in its design.
The controller diagram highlights some high-level abstractions for input/output definitions. However, for example, within the extract
component, we want to be able to support multiple file types (and possibly new ones, i.e. rdf
), so creating an abstraction for between "data download" and "data parsing/input" would allow us to hook in a new file type. Further discussion will ensue and we'll answer a lot of open questions. From there, it'll be easier to start implementing and executing on this work.
User Story
In order to inform existing and new harvesting processes and procedures, the Data.go Architect Team wants to document the FOUR pipelines that all harvesting travels through. These pipelines will either be (1) optimized in the current system or (2) fed into building a better new system from the start.
Acceptance Criteria
Background
Security Considerations (required)
...
Sketch
file json DCAT
DataJsonHarvester
,DatasetHarvesterBase
,HarvesterBase
file xml FDGC/ISO
GeoDataGovDocHarvester
,DocHarvester
,GeoDataGovHarvester
,SpatialHarvester
,HarvesterBase
file xml waf FDGC
GeoDataGovWAFHarvester
,WAFHarvester
,GeoDataGovHarvester
,SpatialHarvester
,HarvesterBase
api json ARCGIS
ArcGISHarvester
,SpatialHarvester
,HarvesterBase
api json DCAT
for large DCAT data.json files that are unwieldy when processed as a single entity.api json DCAT