AlexsLemonade / refinebio

Refine.bio harmonizes petabytes of publicly available biological data into ready-to-use datasets for cancer researchers and AI/ML scientists.
https://www.refine.bio/
Other
128 stars 19 forks source link

Given a ComputedFile, is there a way to know what processor job generated it? #1868

Open arielsvn opened 4 years ago

arielsvn commented 4 years ago

Context

Samples can have many ComputedFiles, which also have ComputationalResults.

Processor jobs create instances of ComputationalResult and ComputedFile when samples are processed.

Problem or idea

We have some samples with multiple computed files, but for each one, it's not obvious what original files were used to generate them. Also, there's no way to know which processor jobs generated them.

For example, sample GSM248431 has multiple computed files.

data_refinery=> select * from computed_files where id in (select computed_file_id from sample_computed_file_associations where sample_id in (select id from samples where accession_code='GSM248431'));
   id    |         filename         |                          absolute_file_path                          | size_in_bytes |                   sha1                   | is_smashable | is_qc | is_qn_target |           s3_bucket            |                      s3_key                       | is_public |          created_at           |         last_modified         | result_id | compendia_organism_id | compendia_version | is_compendia | quant_sf_only | svd_algorithm 
---------+--------------------------+----------------------------------------------------------------------+---------------+------------------------------------------+--------------+-------+--------------+--------------------------------+---------------------------------------------------+-----------+-------------------------------+-------------------------------+-----------+-----------------------+-------------------+--------------+---------------+---------------
 1043270 | GSM248431_GE1002_2_2.PCL | /home/user/data_store/processor_job_1615745/GSM248431_GE1002_2_2.PCL |        143873 | 619f543f7b5ed39180e0e71a69f37b9ad6e11bd8 | t            | f     | f            | data-refinery-s3-circleci-prod | y3wchb8nev4iyz4c6s9ukv0w_GSM248431_GE1002_2_2.PCL | t         | 2018-12-20 14:44:10.890171+00 | 2018-12-20 14:44:11.139879+00 |    927601 |                       |                   | f            | f             | NONE
 1043658 | GSM248431_GE1001_3.PCL   | /home/user/data_store/processor_job_1616335/GSM248431_GE1001_3.PCL   |        143950 | 2688b7951ec72ab49540fcfb4da44d2683715c68 | t            | f     | f            | data-refinery-s3-circleci-prod | kk01w6jrhdp9o0f31mt1pnje_GSM248431_GE1001_3.PCL   | t         | 2018-12-20 14:47:08.301983+00 | 2018-12-20 14:47:12.167675+00 |    927989 |                       |                   | f            | f             | NONE
 1043633 | GSM248431_GE1003_3.PCL   | /home/user/data_store/processor_job_1616313/GSM248431_GE1003_3.PCL   |        143963 | 925810e5b65bdb8e10079e52da93c0c588833589 | t            | f     | f            | data-refinery-s3-circleci-prod | puu08rbp3hoq9aspikktocmd_GSM248431_GE1003_3.PCL   | t         | 2018-12-20 14:47:01.282186+00 | 2018-12-20 14:47:01.845971+00 |    927964 |                       |                   | f            | f             | NONE
 1043207 | GSM248431_GE1004_3.PCL   | /home/user/data_store/processor_job_1615679/GSM248431_GE1004_3.PCL   |        143930 | 4cf20fad148a0632a02df608b3246db9f0992797 | t            | f     | f            | data-refinery-s3-circleci-prod | cxcqnjnuorgarw3kx6www1i0_GSM248431_GE1004_3.PCL   | t         | 2018-12-20 14:43:32.654072+00 | 2018-12-20 14:43:46.704009+00 |    927538 |                       |                   | f            | f             | NONE
 1043592 | GSM248431_GE1002_2_1.PCL | /home/user/data_store/processor_job_1616233/GSM248431_GE1002_2_1.PCL |        143841 | 7fe7f2702e1dca767345906d81a33a07b9031484 | t            | f     | f            | data-refinery-s3-circleci-prod | rcou7jcua1kb9kp2a7ryqzu7_GSM248431_GE1002_2_1.PCL | t         | 2018-12-20 14:46:34.442139+00 | 2018-12-20 14:46:36.158067+00 |    927923 |                       |                   | f            | f             | NONE
 1043205 | GSM248431_GE1003_1.PCL   | /home/user/data_store/processor_job_1615657/GSM248431_GE1003_1.PCL   |        143799 | ad9a3ce7580aca4acb284c531587d8b4ecc25f7a | t            | f     | f            | data-refinery-s3-circleci-prod | 5xnifq0jqtfkxpb9k2mds8ux_GSM248431_GE1003_1.PCL   | t         | 2018-12-20 14:43:32.579096+00 | 2018-12-20 14:43:46.693395+00 |    927536 |                       |                   | f            | f             | NONE
 1043449 | GSM248431_GE1002_1.PCL   | /home/user/data_store/processor_job_1615986/GSM248431_GE1002_1.PCL   |        143796 | 68314b156905068ef1649ed78bd5f7ed14f5ed14 | t            | f     | f            | data-refinery-s3-circleci-prod | ffsdc752f6gyfh6dn31ibow9_GSM248431_GE1002_1.PCL   | t         | 2018-12-20 14:45:31.363883+00 | 2018-12-20 14:45:40.448563+00 |    927780 |                       |                   | f            | f             | NONE
 1044554 | GSM248431_GE1004_1.PCL   | /home/user/data_store/processor_job_1617375/GSM248431_GE1004_1.PCL   |        143816 | 3e72fc3a4f1b4818bc212abc443758511c183b6c | t            | f     | f            | data-refinery-s3-circleci-prod | e1b6si5cgpi02r9n0nyjuwry_GSM248431_GE1004_1.PCL   | t         | 2018-12-20 15:03:08.943553+00 | 2018-12-20 15:03:21.513281+00 |    928887 |                       |                   | f            | f             | NONE
 1043195 | GSM248431_GE1002_3.PCL   | /home/user/data_store/processor_job_1615673/GSM248431_GE1002_3.PCL   |        143916 | 0df30bf5d76da1a2af1a5d4b43381f220a784fc6 | t            | f     | f            | data-refinery-s3-circleci-prod | bm4x03ivvrxt4cd7dzjfsasu_GSM248431_GE1002_3.PCL   | t         | 2018-12-20 14:43:26.870718+00 | 2018-12-20 14:43:42.702274+00 |    927526 |                       |                   | f            | f             | NONE
(9 rows)

And multiple original files:

data_refinery=> select id, filename, is_archive, source_filename from original_files where id in (select original_file_id from original_file_sample_associations where sample_id in (select id from samples where accession_code='GSM248431'));
   id    |         filename         | is_archive |       source_filename       
---------+--------------------------+------------+-----------------------------
 1482345 | GSM248431_GE1002_3.CEL   | f          | GSM248431_GE1002_3.CEL.gz
 1419569 |                          | t          | GSM248431_GE1002_2_1.CEL.gz
 1419302 |                          | t          | GSM248431_GE1002_1.CEL.gz
 1482325 | GSM248431_GE1003_1.CEL   | f          | GSM248431_GE1003_1.CEL.gz
 1482352 | GSM248431_GE1004_3.CEL   | f          | GSM248431_GE1004_3.CEL.gz
 1420599 |                          | t          | GSM248431_GE1003_3.CEL.gz
 1483092 | GSM248431_GE1003_3.CEL   | f          | GSM248431_GE1003_3.CEL.gz
 1483111 | GSM248431_GE1001_3.CEL   | f          | GSM248431_GE1001_3.CEL.gz
 1419866 |                          | t          | GSM248431_GE1002_2_2.CEL.gz
 1484300 | GSM248431_GE1004_1.CEL   | f          | GSM248431_GE1004_1.CEL.gz
 1420090 |                          | t          | GSM248431_GE1002_3.CEL.gz
 1419042 |                          | t          | GSM248431_GE1001_3.CEL.gz
 1483015 | GSM248431_GE1002_2_1.CEL | f          | GSM248431_GE1002_2_1.CEL.gz
 1482723 | GSM248431_GE1002_1.CEL   | f          | GSM248431_GE1002_1.CEL.gz
 1420814 |                          | t          | GSM248431_GE1004_1.CEL.gz
 1482431 | GSM248431_GE1002_2_2.CEL | f          | GSM248431_GE1002_2_2.CEL.gz
 1421034 |                          | t          | GSM248431_GE1004_3.CEL.gz
 1420360 |                          | t          | GSM248431_GE1003_1.CEL.gz
(18 rows)

Solution or next step

I think it makes sense to add a new relation between ComputationalResult and ProcessorJob.

Tagging @kurtwheeler for further discussion.

kurtwheeler commented 4 years ago

I think you're right! I think we should be able to tell what ProcessorJob generated a ComputationalResult. A ComputationalResult will never have more than one ProcessorJob associated with it, so it should just be a processor_job_id property on the ComputationalResult model, one that we'll probably want to not expose via the API? (I think at the moment we aren't exposing anything about jobs via the API.)

arielsvn commented 4 years ago

one that we'll probably want to not expose via the API? (I think at the moment we aren't exposing anything about jobs via the API.)

Actually we have endpoints to expose all the jobs: /jobs/downloader and /jobs/processor. I started using them to list the jobs for each sample at https://github.com/AlexsLemonade/refinebio-frontend/pull/784. Is there any reason not to expose processor_job_id in the API? Would be nice to be able to inspect the jobs associated with a ComputationalResult via the API

kurtwheeler commented 4 years ago

Nope, no reason at all! I just thought we were trying to hide those deets from our users but honestly I was hoping that we'd eventually change that anyway :D