OpenConceptLab / ocl_issues

Issues for all OCL repos. NOTE: Install ZenHub Browser Extension and request access to the OCL Roadmap board to view all issues and to contribute
4 stars 2 forks source link

Add support for importing NPM packages #1791

Open rkorytkowski opened 8 months ago

rkorytkowski commented 8 months ago

Deliverables:

  1. Endpoint for uploading and importing a package for authenticated users under a selected namespace (org or user) (API only) (must for MVP)
  2. Endpoint for fetching a package from NPM registry and importing for authenticated users under a selected namespace (org or user) (API only) (must for MVP)
  3. UI for uploading or fetching a package and importing (UI only) (must for MVP)
  4. Import all currently supported resources i.e. CodeSystem, ValueSet, ConceptMap (must for MVP)
  5. Overwrite existing resources if id and version matches in the same namespace (must for MVP)
  6. Create new versions of resources if only id matches (must for MVP)
  7. A summary of imported packages and created/updated resources (nice to have for MVP)
  8. A dialog presented before importing for the user to confirm which lists all package dependencies that needs to be imported (nice to have for MVP)
paynejd commented 3 months ago

@rkorytkowski to add list of work completed to this ticket (since NPM import ended up requiring many improvements across OCL)

rkorytkowski commented 3 months ago

As part of this issue I did an overhaul of the import backend to support big imports and resolve import issues:

  1. No longer store import files in RAM memory, but in temporary files on solid drive including fetching files from remote location by celery workers instead of distributing them via Redis.
  2. Support tar, json and zip files. Read tar and zip files with on-the-fly extraction with no separate extraction to solid drive to save space on workers.
  3. Read json files using streaming library to only load small portions of files in memory.
  4. Distribute a single json file with multiple resources between parallel workers by using start and end index pointers in the file.
  5. Use Celery chains and groups for processing import files in order and in parallel. Order resources by types and import same types in parallel. Respect ordering for NPM package dependencies.
  6. Get rid of a main celery tasks, which monitors all parallel workers. Complete the main task as soon as all chains and groups are sent to Celery. It is to release memory used by the main worker and eliminate the issue of main worker being terminated upon deployment/upgrade and loosing the ability to continue tracking import progress and gathering results. It is also to be able to easily continue import after redeployments.
  7. Introduce a final task, which gathers results from individual workers and stores them in DB so they can be quickly retrieved.
  8. Introduce progress tracking, which gathers progress from individual workers as they continue to work on batches.