Is there a way to specify sample and controls in runs?

gagneurlab / drop

Pipeline to find aberrant events in RNA-Seq data, useful for diagnosis of rare disorders

MIT License

128 stars 43 forks source link

Is there a way to specify sample and controls in runs? #521

Closed dissakov closed 4 months ago

dissakov commented 4 months ago

Hello! I'm trying to compare one sample against ~200 controls using FRASER/DROP. The way I understood the setup, I made a tsv containing the paths to all the data and pointed to it in the config file. However, this results in it running an analysis for each of the 201 files (comparing them I assume to everything else), but I'm only interested in comparing the one sample to the 200 others - the other 200 comparisons are just extra time/memory.

Is there a way to specify this somewhere in the config or elsewhere? I haven't been able to find it in the documentation. Thank you!

vyepez88 commented 4 months ago

Hi, the analysis is done per junction, not per sample. All 201 samples are needed to compute the mean and dispersion parameters of each junction. You could edit the results script to export only your sample of interest, but that would only shorten the total run by <5 minutes.

dissakov commented 4 months ago

Thank you for your response! We're running quite a few separate runs, and it's taking about 1.5 days to run the above on our computational cluster. Is there anything that can be done to make this faster or less computationally intensive? Thanks!

vyepez88 commented 4 months ago

It is a lot 1.5 days, for me the fitting of ~200 samples takes around 3-4h. Are you running FRASER or FRASER2? I usually provide at least 30 cores and 200 Gb of memory. Be sure to add the --rerun-triggers mtime parameter.

dissakov commented 4 months ago

FRASER2. The runs are with different samples, so I don't believe anything can be reused between them? I tried giving more cores, but the runtime remained the same; the run seems to be using ~70GB of memory, and it's being given 100GB.

vyepez88 commented 4 months ago

the split counts can be shared. That will work if you do the different runs by giving multiple DROP groups (instead of executing DROP in different repositories)

dissakov commented 4 months ago

Okay, so if my 200 controls are shared, should I have multiple groups of 201 specified in the input tsv (200 controls + 1 sample)? Is that what you mean? Or do I just specify the controls as one group so the split counts are shared? Thanks!

vyepez88 commented 4 months ago

let's say you have 200 controls and 5 samples that you want to test independently. Then, you could set your sample annotation like this: RNA_ID DROP_GROUP ctrl1 group_s1, group_s2, group_s3, group_s4, group_s5 ... ctrl200 group_s1, group_s2, group_s3, group_s4, group_s5 sample1 group_s1 ... sample5 group_s5

And have only one project with one config file. In that way you'll run 5 analysis, but you would have to do the split counts only once for all 205 samples.

dissakov commented 4 months ago

Oh, I see - so one sample can be in multiple drop_groups?

vyepez88 commented 4 months ago

yes, comma separated. Some examples here: https://gagneurlab-drop.readthedocs.io/en/latest/prepare.html#external-count-examples

dissakov commented 4 months ago

Understood. Thanks!