byu-dnasc / proto-smrtlink-share

0 stars 1 forks source link

Design functionality for identifying completed jobs associated with shared datasets #20

Open adknaupp opened 6 months ago

adknaupp commented 6 months ago

Method

Identify analyses associated with shared datasets

1. Every 5-10 seconds, get all RUNNING jobs.

SmrtLinkClient.get_analysis_jobs_by_state() to get (probably) only those jobs whose state is RUNNING. This should get all possible jobs of interest based on the assumption that any job the app would be interested in should have had to have run for at least 5-10 seconds.

2. Ignore all but the "new" jobs

Each time the active jobs are GET'ed from the database, most will have already been handled. To identify the new jobs, there must be some way of determining the jobs that are currently being polled. The 'new' jobs are all the jobs not being polled.

How to keep track of which jobs are being polled

???

3. Further, ignore any job not associated with a shared dataset

A new table needs to be created in the peewee database and the project table should be modified to remove the datasets column. Instead, the new table will keep track of shared datasets and which project they are associated with. One column of the new table should store a dataset uuid and the other a project id.

4. Start polling each remaining job until it changes state.

Use SmrtLinkClient.poll_for_successful_job() to poll until the job changes state. Once the function returns, check whether the final state was SUCCESSFUL.

adknaupp commented 5 months ago

Reconsider

I'm no longer planning on implementing the functionality above, i.e. polling for jobs to stage. Instead, jobs will be identified at the handling project requests. This makes it much easier to identify which jobs are part of a project. From the users perspective, the only thing that will change is that they will have to either wait until all analyses are complete before adding a dataset to the project, or they will have to "save" a existing project to trigger a request that will cause any new jobs to be identified.

Staging folder assignment

Job files should be staged in their own subfolder of the folder of the associated dataset. This means that although some files generated by a job associated with a parent dataset may relate only to a given child dataset, the files will be found within the parent dataset's folder, not that of the child dataset.