SACGF / variantgrid

VariantGrid public repo
Other
23 stars 2 forks source link

Seqauto API #76

Open davmlaw opened 3 years ago

davmlaw commented 3 years ago

VariantGrid Sequencing API

Goal: TAU diagnostic team make API calls to upload VCFs and other sequencing related data to VariantGrid at the end of their pipelines

Current behavior: VariantGrid runs scheduled (and manually triggered) scans that use “find” to scan the TAU disks for sequencing files and loading them into the system

Motivating problems w/current:

“Find” causes high server/network/disk usage Lag between pipeline finishing and VG finding new files (requires manual button press or waiting for next scheduled scan) VG needs to have filesystem access (cloud systems have no sequencing file automation) VG developers need to know TAU details (eg paths to different VCF files used by different kits/panels) and update code

—--------

Plan:

  1. DML creates idempotent[1] API to do everything sequencing related (add sequencing runs, sample sheets, bams, link VCFs to sequencing runs etc).
  2. DML writes “variantgrid_api” client library (pip package) that simplifies making API calls [2]
  3. DML re-implements the “sequencing scanning” as a script external to VariantGrid (running as a schedule task on the VM) This uses my existing code implementations of “how to find a VCF file from a haem sequencing run”
  4. DML tuns off scheduled VG sequencing scans (will still leave manual button there) At the end of this step, there should be no visible behavioural change to users (stuff still loads as it did)
  5. DML sends code from (3) above to TAU diagnostic team
  6. TAU team take example code and modify it to fit in their pipeline. Eg as the very last step of a successful pipeline, they call the code to upload files and link them etc. This can use TAU code to eg “how to find a VCF file from a haem sequencing run”, meaning they now have responsibility for that
  7. TAU team tests their code against the test server
  8. TAU team uses code in prod / Auto scan script in (3) disabled. If anything doesn’t work we can just manually re-scan or turn back on (3) VG internal scan code removed from latest code (VG4). It will be left in VG3 but disabled. This reduces complexity from our system (don’t need to keep track of directory structure any more etc)

[1] An idempotent API is one where the operation will have the same result no matter how many times it's applied. This is so we can leave the scanning on, and have pipelines run multiple times, without having to write special purpose client side code (eg to handle errors if sequencing run already added by auto scan or previous pipeline runs)

[2] API example:

            from variantgrid_api import VariantGridAPI()

            vg_api = VariantGridAPI()
            uploaded_vcf_id = vg_api.upload_vcf(combined_vcf_filename)
            vg_api.link_vcf_to_sequencing_run(sequencing_run, sample_sheet_hash, uploaded_vcf_id)

We already have PyPI - https://pypi.org/project/variantgrid_api/0.2.0/

The only thing that uses this is Maptastic which is an obsolete version of omni-importer

We have an old repo: https://github.com/SACGF/vg_api Created new repo: https://github.com/SACGF/variantgrid_api

The PyPI variantgrid_api should be minimalistic - the details of loading SA Path stuff will be done in other code and put in SA Path repo

davmlaw commented 1 month ago

Talked to James and API keys is probably the way to go, have a quick read of some docs about this:

https://www.django-rest-framework.org/api-guide/authentication/

~working in branch "seqauto_api"~ - pushed phase 1 to master as I may have to do other stuff for a while

Plan for step 1 - build API:

Plan for step 2 and 3 - write client code

The client calling code and libraries to load objects will be kept in SA Path repo

davmlaw commented 1 week ago

OK been working out how to best do things:

davmlaw commented 1 week ago

Can use SlugRelatedField to lookup via natural keys

davmlaw commented 1 week ago

SampleGeneList needs a sample, and VCF may not be loaded by the time we load the QC Gene list. We just try QCGeneList.create_and_assign_sample_gene_list then handle the SampleFromSequencingSample.DoesNotExist and carry on if VCF not loaded yet

Then in VCF import we run: upload.vcf.vcf_import.link_samples_and_vcfs_to_sequencing which attempts it again

This should probably be a signal that we handle in seqauto app but can do that later

For the API - we'll just make ActiveSampleGeneList the latest one


TODO: