[MVP] Initialize Flask app

btylerburton commented 1 year ago

User Story

In order to begin work on the MVP for Harvesting 2.0, datagovteam would like to initialize a Flask application.

Related to:

https://github.com/GSA/data.gov/issues/4317

Acceptance Criteria

[ACs should be clearly demoable/verifiable whenever possible. Try specifying them using BDD.]

[ ] GIVEN [a contextual precondition] \ [AND optionally another precondition] \ WHEN [a triggering event] happens \ THEN [a verifiable outcome] \ [AND optionally another verifiable outcome]

Background

[Any helpful contextual notes or links to artifacts/evidence, if needed]

Security Considerations (required)

[Any security concerns that might be implicated in the change. "None" is OK, just be explicit here!]

Sketch

[ ] Create new application in Cloud.gov development space
[ ] Create new repo / directory, manifest, and requirements that will initialize a flask app on demand when pushed to that new Cloud.gov application

nickumia-reisys commented 1 year ago

Posting this here so that others can see where @Jin-Sun-tts got inspiration from to do some of the work for this ticket...

dataset view route | /dataset (similar to /api/action/package_search)
- read all datasets from db
- return json view of datasets
```
datasets = query_dataset_table()
json_view = [tojson(dataset) for dataset in datasets]
return json_view
```

harvest source create route | /harvest/create

pass data as arguments
returns success/error information + source id

def create_route(name, url):
    if valid(name) + valid(url):
        try:
            source_id = generate_uuid()
            result = create_dataset_record(source_id, name, url)
            return json(source_id, result)
        except:
            # db error?
    else:
        # respond with whether name or url was invalid

harvest source view route | /harvest/source/

pass id of source
return source information

def harvest_view(source_id):
    try:
        return json(query_source_table(source_id))
    except:
        # not a valid source

Is harvest job creation different from running?

harvest job create route | /harvest/create/

manual runs only
return job id + summary

import threading
class Job:
    def __init__(self, ...):
        self.name = ""
        self.state = ""

    def run(self):
        try:
            success, s3_paths = extract()
        except:
            # job failed
        try:
            working_datasets = compare(s3_paths, source_id)
        except:
            # job failed

        threads = []
        for wd in working_datasets:
            wip = threading.Thread(target=process_dataset, args=(wd))
            wip.start
            threads.append(wip)

        for thread in working_datasets:
            thread.join()

        # controller creates job summary

    def process_dataset(self, dataset):
        if validate(dataset):
            new_dataset = tranform(dataset)
            success = load(dataset)

harvest job run route | /harvest/run/
- pass job id
- return summary

harvest job summary route | /harvest/status/

pass job id and/or source id
return list of jobs and statuses

def job_summary(job_id=''):
    if job_id:
        summary = query_source_table(job_id)
    else:
        if valid(job_id):
            try:
                summary = query_source_table(all=True)
            except:
                # db error
        else:
            # respond with job id is invalid
    return json(summary)

dcat-us extract | /extract/???

(component done?)
this is the controller route that kicks off the component
pass url
return reference to list of datasets (stored in s3)
kick off compare
for each job, only one process
mark job as success/error on extract
- fatal if error (stop job)

import harvester.extract as he
def extract(source_id, job_id, url):
    if not valid(source_id):
        # respond accordingly
    if not valid(job_id):
        # respond accordingly

    success, s3_paths = he.main({"job_id": job_id, "source_id": source_id, "url": url})
    if not succcess:
        # update

dcat-us compare | /compare/???
- (???)
- pass list of datasets
- return reference to changes/new/deletions (stored in s3)
- kick off validate for each dataset
- for each job, only one process
- mark job as success/error on compare
  - fatal if error (stop job)
dcat-us validate | /validate/???
- (component done?)
- this is the controller route that kicks off the component
- pass dataset
- return valid/invalid
- stops + logs error if invalid
- kicks off tranform if valid
- for each job, there could be any number of parallel processes
  - if there's 100 datasets,
    - we could run 10 processes and queue them to complete
    - we could run 100 processes and just have them run a specific dataset
- mark job as in "validation"
  - non-fatal if error (aggregate errors)
dcat-us transform | /transform/???
- no-op/empty
dcat-us load | /load/???
- (ideally into ckan?)
- practically speaking... for the MVP, add/update/delete to/from db?
- pass final record
- return success/error
- aggregate status for job completion
- for each job, there could be any number of parallel processes
  - if there's 100 datasets,
    - we could run 10 processes and queue them to complete
    - we could run 100 processes and just have them run a specific dataset
- mark job as in "load"
  - non-fatal if error (aggregate errors)
interact with s3
- (done?)
- track number of files in each component prefix
- inspect (read) any given file in s3
- upload new files to s3
insert/update/delete dataset from db
- (database design pending?)
- table for datasets
  - what minimum columns/data do we want to track?
- table for harvest sources
  - url
  - source id
- table for harvest jobs (link with harvest source)
  - job id
  - source id
  - number of datasets harvested (in s3?)
  - errors
a job can have multiple states assocaited with it
- can be in validate/tranform/load at the same time (depending on what datset is processing)
no search functionality yet?
no database version control yet (alembic)
no frontend ui pages (only data view)

Jin-Sun-tts commented 1 year ago

https://github.com/GSA/datagov-harvesting-logic/pull/14

GSA / data.gov