AlexsLemonade / scpca-portal

Single-cell Pediatric Cancer Atlas Portal is a growing database of uniformly processed single-cell data from pediatric cancer tumors and model systems
https://scpca.alexslemonade.org
BSD 3-Clause "New" or "Revised" License
3 stars 0 forks source link

Refactoring of Load_Data #634

Closed avrohomgottlieb closed 4 months ago

avrohomgottlieb commented 6 months ago

Context

This epic is meant to track the refactoring of the load_data management command and it's related functionality. The goal of this refactoring is to better accommodate separation of concerns between metadata extraction and file computation, and to prepare us for the upcoming integration of AWS Batch (for file computation).

Goal

We want the end result of the changes implemented during this epic to result in the following:

Notes on Approaches

Realistically this will be a combination of two approaches.

Option 1 "Inheritance":

Ex:

class ProjectLoader(models.Model):
    """Common attributes for Project, Sample, and ComputedFile models."""

    class Meta:
        abstract = True

    def load_data():
        pass

class Project(ProjectLoader):
    pass

# load_data.py
class Command:
    def run():
        Project.load_data()

Option 2 Modularity

class Loader():

    @download_if_missing
    def load_project():
        pass
# management commnand
# another class that does the actual thing
class Command():
    def run():
        # handle inputs
        Loader.reload_project()
        Loader.reload_sample()
        Loader.load_project()
        Loader.load_sample()
        Loader.purge_project()

Directory Structure

Services/ |------ FileSystem (setup work dir / clean up work dir) |------ Downloader (downloading data from s3) |------ Metadata (get downloaded files -> create new combined metadata) |------ Factory (create and save projects/ samples from metadata) |------ Archiver (create zip files from a project / sample in the database)

Command Workflow

LoadDataCommand // Runs on API (fast) / updates database
CreateZipCommands // Runs on Batch (slow) / updates s3
davidsmejia commented 6 months ago

Yeah this is good, just going to add that our high level approach should be:

avrohomgottlieb commented 4 months ago

Addressed in merging of feature/refactor-load-data in PR https://github.com/AlexsLemonade/scpca-portal/pull/713