riverma commented 1 year ago

Checked for duplicates

Yes - I've already checked

Describe the need

We have a need for recommendations for choice of packaging host (GH Packages, DockerHub, etc.), including automation architecture / solutions for inclusion of dependencies into builds. (+1'd by @mike-gangl). One of the things that would be great here is specific choice of, and details of how to interact with packaging hosts / managers.

riverma commented 1 year ago

@jpl-jengelke recommends incorporating this into the current CI guide

riverma commented 1 year ago

+1'd by @ramesh-maddegoda, @drewm-jpl, @nttoole, @kgrimes2, @hookhua, @carlynlee

riverma commented 1 year ago

Recommendations

Packages (Python)

Suggested Repository:
- PyPI
Suggested Naming convention
- nasa-[project org]-[module name] [semantic version ID]
Known benefits
- Free
Known constraints
- 60MB file size limit, subject to increase by PyPi-support
  Packages (Java)
Suggested Repository:
- Maven Central
Suggested Naming convention
- gov.nasa.[project org].[module name]
Known benefits
- Free
- No specific constraints to package size or volume
Known constraints
- N/A
  Packages (NodeJS)
Suggested Repository:
- NPM Registry
Suggested Naming convention
- @nasa-[project org]/[module name]
Known benefits
- Free
- No specific constraints to package size or volume
Known constraints
- Current NASA-branded packages vary in terms of account ownership and naming convention, potentially causing confusion
  Packages (Miscellaneous)
Repository:
- Store in you Cloud-based VCS of choice (e.g. GitHub Releases, GitLab Package Registry)
Suggested Naming convention
- Semantic versioning: https://semver.org
Known benefits
- Free
- Unlimited number of packages
Known constraints
- Typically < 2GB individual package limit

Infrastructure Deployments (Terraform)

Repository:
- Terraform Registry
Suggested Naming convention
- terraform-nasa-[project org]-modules/[module-name]
Known benefits
- Free
Known constraints
- No officially sponsored NASA namespace currently exists, potentially causing confusion
- 1000 document limit per account
- 500KB max file size per document

Test Data (Small < 2GB)

Repository:
- Create a new repository in you Cloud-based VCS of choice (e.g. GitHub Releases, GitLab Package Registry)
Suggested Naming convention
- [project org]-[project module]-test-dataset
Known benefits
- Free
Known constraints
- Typically < 2GB dataset limit

Test Data (Large: 2GB - 100GB)

Repository:
- Amazon S3 Pre-signed URL
Suggested Naming convention
- N/A
Known benefits
- Scalable storage
- Authentication to rate-limit bandwidth usage
Known constraints
- Non-free

Test Data (Large > 100GB)

Repository:
- Earth-centric data: EOSDIS Distributed Active Archive Centers (DAAC)
- Planetary data: Planetary Data System
Suggested Naming convention
- Determined by DAAC
Known benefits
- Support for extremely large data volumes
Known constraints
- Requires project agreement and sponsorship

Containers (Archival / Public)

Repository:
- Store in you Cloud-based VCS of choice (e.g. GitHub Packages, GitLab Package Registry)
Suggested Naming convention
- nasa-[project org]-[project module]:[tag]
Known benefits
- Free
- No size or bandwidth limits known for public repositories
Known constraints
- Usage limitations on private repositories
- High-latency for on-demand, runtime applications

Containers (Runtime / Private)

Repository:
- Amazon Elastic Container Registry (ECR)
Suggested Naming convention
- nasa-[project org]-[project module]:[tag]
Known benefits
- Private repositories
- Low-latency pulls for runtime usage, especially in Amazon Web Services (AWS)
Known constraints
- Subject to pricing

drewm-jpl commented 1 year ago

Hi @riverma,

Regarding repositories for test data, it might be worth looking at the data repository guidance provided by Scientific Data - Nature (https://www.nature.com/sdata/policies/repositories).

In particular, their list of recommended generalist data repositories may be pertinent:

galenatjpl commented 1 year ago

@riverma it looks like you have done a great job defining the repositories and formats that I would expect here. I'm mostly familiar with Maven central, and PyPI from building things in the past.

I think one thing to consider (which may be tangential to this ticket), is how/when do we push artifacts to these places? We have sort of thought about some notional methodologies related to this (see the blue part of this diagram).

My thoughts about test data are that: 1). We will be hopefully centralizing on a single representative "golden dataset" that exercises the capabilities we care to test. 2). As such, we should probably just store that dataset in S3, and be done with it. We aren't going to be storing gobs and gobs of data, but we just need that representative "starter" data. Any data produced as a result of SPS runs can be transitory, and deleted relatively quickly after verification. In other words, we aren't an actual mission, and won't had the Life Of Mission data requirements and associated costs. If we need to store several gigabytes of data on S3, it's not going to break the bank.

That being said, I haven't taken a look at the repositories @drewm-jpl mentioned. I do know that we are all familiar with AWS/S3 though..

galenatjpl commented 1 year ago

Also, you might want to take a quick look at AWS CodeArtifact, but perhaps that wouldn't be the best solution to have work with a fully open-source building process. Or maybe it would work? Other public services like maven and pypi might be better, but just pointing out CodeArtifact, in case it wasn't looked at as part of this eval.

riverma commented 1 year ago

From @mike-gangl: see https://blog.pypi.org/posts/2023-04-23-introducing-pypi-organizations/

NASA-AMMOS / slim

[New Process Improvement Need]: Artifact packaging hosting and dependency management #69

Checked for duplicates

Category

Describe the need

Recommendations

Packages (Python)

Packages (Java)

Packages (NodeJS)

Packages (Miscellaneous)

Infrastructure Deployments (Terraform)

Test Data (Small < 2GB)

Test Data (Large: 2GB - 100GB)

Test Data (Large > 100GB)

Containers (Archival / Public)

Containers (Runtime / Private)