Refactoring of Load_Data

avrohomgottlieb commented 6 months ago

Context

This epic is meant to track the refactoring of the load_data management command and it's related functionality. The goal of this refactoring is to better accommodate separation of concerns between metadata extraction and file computation, and to prepare us for the upcoming integration of AWS Batch (for file computation).

Goal

We want the end result of the changes implemented during this epic to result in the following:

Break up code along behavior
Move similar code together
Allow for different workflows for different environments / command arguments

Notes on Approaches

Realistically this will be a combination of two approaches.

Option 1 "Inheritance":

Ex:

class ProjectLoader(models.Model):
    """Common attributes for Project, Sample, and ComputedFile models."""

    class Meta:
        abstract = True

    def load_data():
        pass

class Project(ProjectLoader):
    pass

# load_data.py
class Command:
    def run():
        Project.load_data()

Option 2 Modularity

class Loader():

    @download_if_missing
    def load_project():
        pass

# management commnand
# another class that does the actual thing
class Command():
    def run():
        # handle inputs
        Loader.reload_project()
        Loader.reload_sample()
        Loader.load_project()
        Loader.load_sample()
        Loader.purge_project()

Directory Structure

Services/ |------ FileSystem (setup work dir / clean up work dir) |------ Downloader (downloading data from s3) |------ Metadata (get downloaded files -> create new combined metadata) |------ Factory (create and save projects/ samples from metadata) |------ Archiver (create zip files from a project / sample in the database)

Command Workflow

LoadDataCommand // Runs on API (fast) / updates database

Purge.delete_project_from_rds() / remove from db
- Purge.delete_project_from_s3() / remove from aws
Purge.purge_project() (remove from db / remove computed files on aws)
FileSystem.clean_inputs()
Downloader.download_metadata()
Metadata.combine_metadata()
Factory.create_project() / Factory.create_sample() / set status to 'pending'
FileSystem.clean_all()

CreateZipCommands // Runs on Batch (slow) / updates s3

FileSystem.clean_inputs()
Downloader.download_data()
Metadata.generate_metadata_files()
Readme.generate_project_readmes()
Archiver.zip_project() / generates zip on local disk
- Factory.create_computed_file() / creates representation for the computed file
- Uploader.upload_project() / uploads computed file to s3
- Uploader.attach_to_project() / project.computed_files.add(computed_file).save()
FileSystem.clean_project()

davidsmejia commented 6 months ago

Yeah this is good, just going to add that our high level approach should be:

Break up long functions
collect / combine new functions / objects
move logic into new files related by behavior

avrohomgottlieb commented 4 months ago

Addressed in merging of feature/refactor-load-data in PR https://github.com/AlexsLemonade/scpca-portal/pull/713

AlexsLemonade / scpca-portal