AlexsLemonade / scpca-portal

Single-cell Pediatric Cancer Atlas Portal is a growing database of uniformly processed single-cell data from pediatric cancer tumors and model systems
https://scpca.alexslemonade.org
BSD 3-Clause "New" or "Revised" License
3 stars 0 forks source link

Management command for portal wide metadata download #797

Closed nozomione closed 1 month ago

nozomione commented 2 months ago

Context

Parent issue: #708

Feature branch: feature/portal-metadata-command

We'll be adding the portal-wide metadata only download feature to the portal. This allows users to download and review all the metadata information available on the portal without having to download the project's computed files.

The downloadable portal-wide metadata zip file will contain the following files:

Problem or idea

To accomplish this, we should do the following:

Breakdown of the expected steps
The expected steps are as follows: 1. In the `ComputedFile` model, add the following constants and a model filed. (**NOTE:** The readme's output data path is omitted here to demonstrate the use of a buffer which is subject to change. See the temporarily comment added at the bottom of this issue, which will be deleted later.) - A new boolean field `portal_metadata_only` for FE: ```py portal_metadata_only = models.BooleanField(default=False) ``` - The data path of the output TSV file that is used to write and generate `metadata.tsv` contained in the zip file: ```py OUTPUT_PORTAL_METADATA_FILE_PATH = common.OUTPUT_DATA_PATH / f"portal_metadata.tsv" ``` - The output path of the computed file zip that will be uploaded to S3 bucket for a file download: ```py OUTPUT_PORTAL_METADATA_COMPUTED_FILE_NAME = "portal_metadata.zip" ``` 2. Create a new class method `ComputedFile::get_portal_metadata_file` which will be executed from the management command. This method does the following: - Query all libraries metadata from the `Library` model - Perform the following steps to generate the output TSV file that is used to write `metadata.tsv`: - `Library::get_combined_library_metadata` to aggregate the combined metadata for the portal-wide - `Common.METADATA_COLUMN_SORT_ORDER` and `utils.filter_dict_list_by_keys` to remove unneeded keys - `metadata_file.write_metadata_dicts` to write the output TSV file to the data path - Create the downloadable zip file containing `README.md` and `metadata.tsv` for S3 bucket - Create the portal-wide metadata `computed_file` instance for the database e.g.) ```py # models/computed_file.py @classmethod def get_portal_metadata_file(cls, workflow_versions): """ Queries all libraries to aggregate the combined metadata, writes the aggregated combined metadata to an output TSV file, computes a zip archive using the output TSV and the readme buffer, and instantiates and returns a computed file object. """ # Query all libraries libraries = Library.objects.all() # Abort error early if no libraries if not libraries.exists(): return # Aggregate the combined metadata libraries_metadata = [ lib for library in libraries for lib in library.get_combined_library_metadata() ] # Remove unnecessary metadata keys based on the common sort order filtered_libraries_metadata = utils.filter_dict_list_by_keys( libraries_metadata, Common.METADATA_COLUMN_SORT_ORDER ) # Create the portal-wide metadata computed file instance computed_file = cls( portal_metadata_only=True, # pass the flag for FE S3_bucket=settings.AWS_S3_BUCKET_NAME, S3_key=cls.OUTPUT_PORTAL_METADATA_COMPUTED_FILE_NAME, size_in_bytes=zip_file_path.stat().st_size, workflow_version=utils.join_workflow_versions(workflow_versions), ) # Write the filtered metadata to output TSV file utils.write_metadata_dicts( filtered_libraries_metadata, cls.OUTPUT_PORTAL_METADATA_FILE_PATH ) # Create the zip file for a file download # NOTE: This will change once the buffer implementation is available with ZipFile(computed_file.zip_file_path, "w") as zip_file: # For README.md zip_file.writestr( cls.OUTPUT_README_FILE_NAME, cls.get_portal_metadata_readme() # using the buffer value ) # For metadata.tsv zip_file.write( cls.OUTPUT_PORTAL_METADATA_FILE_PATH, # using the output file computed_file.metadata_file_name ) return computed_file ``` 3. Add a new management command file `create_portal_metadata.py` to `management/commands/`. This management command does the following: - Call `ComputedFile::get_portal_metadata_file` to: - Create the portal-wide computed file instance `computed_file` - Create a zip file `portal_metadata.zip` for a file download - Perform the following steps by calling the instance method `process_computed_file` (*which will be deleted in upcoming changes): - Save `computed_file` to the database - Upload `portal_metadata.zip` to S3 bucket - Clean up the output data file (optional) e.g. ) ```py # management/commands/create_portal_metadata.py import os import logging from argparse import BooleanOptionalAction from django.core.management.base import BaseCommand logger = logging.getLogger() logger.setLevel(logging.INFO) class Command(BaseCommand): help = """ Creates the computed file instance and a zip for portal-wide metadata, saves the instance to the database, and uploads the zip to S3 bucket.""" @staticmethod def clean_up_output_data(): """Cleans up the output TSV file after processing the computed file""" file_path = ComputedFile.README_PORTAL_METADATA_PATH if os.path.exists(file_path): logger.info("Cleaning up output data") os.remove(file_path) else: logger.info(f"No '{file_path}' exists") def add_arguments(self, parser): parser.add_argument( "--clean-up-output-data", action=BooleanOptionalAction, default=settings.PRODUCTION ) def handle(self, *args, **kwargs): self.create_portal_metadata_file(**kwargs): def create_portal_metadata_file(self, **kwargs): logger.info("Creating the portal-wide metadata computed file") computed_file = ComputedFile.get_portal_metadata_file() # Make sure if it exists if computed_file: logger.info("Saving to the database and uploading to S3" computed_file.process_computed_file(True, True) if kwargs["clean_up_output_data"]: self.clean_up_output_data() ``` (**NOTE:** Using buffers for both readme and TSV eliminates the need for `clean_up_output_data`, thus it may be removed later) 4. Register the management command as `create_portal_metadata` in `scpca-portal/bin/sportal`. e.g. ) ```sh # List of available commands. commands = { "create-portal-metadata": run_api.format("./manage.py create_portal_metadata {}") } ```

Solution or next step

We should break up the expected steps into three stacked PRs:

For the data saving/purging workflow, we'll file separate issues.

nozomione commented 2 months ago

[!note] This temporarily comment will be deleted later:

Currently implementation details match this commit ComputedFile::get_project_file in feature/remove-file-mappings but will be adjusted later based on upcoming updates (e.g., using buffers instead of output data files to write and generate downloadable files, breaking up Computed::process_computed_file method).

e.g.) By using a buffer, the content of the portal-metadata readme file may be generated as follows (*this is not a concrete implementation but concept):

from io import StringIO
from django.template.loader import render_to_string

def get_portal_metadata_readme():
  """Generates a portal-wide metadata only README content for zipping"""
  readme_buffer = StringIO()
  readme_buffer.write(
     render_to_string(
        ComputedFile.README_TEMPLATE_METADATA_PATH,
        context={
           "date": utils.get_today_string(),
           "projects": Project.objects.filter(additional_restrictions__isnull=False).all()
        }
     ).strip()
   )
  # Make sure to reset the stream position to 0
  readme_buffer.seek(0)
  # Return the content of readme file
  return readme_buffer.getvalue()

The ComputedFile::get_portal_metadata_file above temporarily writes the content of readme file using the above example method to illustrate the use of a buffer.

nozomione commented 1 month ago

All stacked PRs are merged into the feature branch, so closing this.