AlexsLemonade / refinebio

Refine.bio harmonizes petabytes of publicly available biological data into ready-to-use datasets for cancer researchers and AI/ML scientists.
https://www.refine.bio/
Other
129 stars 19 forks source link

RNA-Seq experiments are being shown which haven't had tximport run at the experiment-level #965

Open kurtwheeler opened 5 years ago

kurtwheeler commented 5 years ago

Context

In #963 I said:

After discussing with @cansav09 we've identified that for experiment-level smash jobs we should be making sure to use files that came from the same tximport run.

Problem or idea

I noticed today that the number of Result objects in the database is fewer than the number shown in the frontend:

image

(Note the 422 total results)

data_refinery=> select count(*) from computed_files where filename='txi_out.RDS';
 count 
-------
   353
(1 row

(Yes I know that's an odd query to get this count, but I'm currently working on a branch to make this a more explicit relationship.)

The fact that these numbers don't match doesn't worry me, in fact I checked because I kinda expected it. In the front end we display experiments that have at least 1 sample with is_processed=True. Therefore any experiment which has a sample belonging to more than one experiment can end up being displayed once one of those other experiments get processed.

The problem comes in when someone adds these experiments to their dataset and then decides to smash at experiment level when tximport hasn't yet been run for that experiment. To make this clearer, here's an example:

Experiment A has Samples 1 and 2 and has had tximport run on it. Experiment B has Samples 3 and 4 and has had tximport run on it. Experiment C has Samples 1, 2, 3, 4, and 5 but Sample 5 hasn't been processed yet, so tximport has not yet been run for the full experiment.

The user decides he wants to download what data is available for Experiment C and chooses to smash at the experiment level.

Experiment C has samples 1, 2, 3, and 4 available, but they were produced in two different runs of tximport (two of them for Experiment A, two of them for Experiment B) and therefore it is probably not a good idea to combine them.

Solution or next step

I'm not 100% sure if this is the optimal way to handle or prevent this, but I have A solution. I am currently working on a PR to add ExperimentResultAssociations between experiments and the ComputationResult objects created for tximport. For experiments that are ONLY RNA-Seq, we could only show them on the frontend if they have one of these associations.

New Issue Checklist

kurtwheeler commented 5 years ago

I'm gonna tag @jaclyn-taroni and @cansav09 because this relates to what is and isn't scientifically valid to smash together, @Miserlou because it involves the smasher and kinda a data model question, and @arielsvn because the currently proposed solution will involve changes to what we display in the frontend and the API.

cansavvy commented 5 years ago

Once you have the ExperimentResult associations, it might be good to get some numbers on how many samples this really affects and how much overlap there are in experiments that have the same samples. If you are able to get me the data, I'd be happy to get some graphs and stats on it. I think determining how much overlap there is will be the first step to telling us how to go about solving this.

kurtwheeler commented 5 years ago

I actually think that I can give you some of those numbers already. As far as how many samples this is affecting SO FAR, I think it should be the difference between:

data_refinery=> select count(*) from samples where is_processed='t' and technology='RNA-SEQ';
 count 
-------
  9951
(1 row)

And what the frontend is displaying (because the frontend is counting experiment-sample relations for processed samples, not number of processed samples). Per that earlier screenshot, that number is 10759.

So therefore the number of affected samples is 10759 - 9951 = 808. This appears to be about 8% of our currently processed samples.

What exactly are you looking for with:

how much overlap there are in experiments that have the same samples

? Like a list of experiments that have at least one shared sample and the percentage of samples that are shared for each experiment? I could probably pull that with some effort, however I'm curious what you would do with that data. 10000 samples is probably ~.5% of the total volume of RNA-Seq that we will successfully be able to process, so I think we should be wary of any solutions that come with caveats along the lines of "as long as we don't have any experiments that share more than X% of its samples". Basically if you're worried an edge case might exist, chances are that we'll probably run into it.