bigbio / quantms

Quantitative mass spectrometry workflow. Currently supports proteomics experiments with complex experimental designs for DDA-LFQ, DDA-Isobaric and DIA-LFQ quantification.
https://quantms.org
MIT License
28 stars 35 forks source link

Improve the skip pre-analysis parameter logic #337

Closed ypriverol closed 1 month ago

ypriverol commented 8 months ago

Description of feature

In the following, PR #335 @jspaezp introduced the possibility to perform the DIA without the preanalysis step. It would be great to use the SDRF information to sub-select a few RAW files to perform the preanalysis which is better to generate the final results. Some of the SDRF columns that could be selected to generate the preanalysis are:

jspaezp commented 8 months ago

With internal discussion we talked about an automatic random subsetting of files to generate an empirical library.

we favored subsetting a 'maximum number of files', rather than to a 'percentage subset', since it offers both

  1. More re-usability of configurations (I would just not be used on small runs).
  2. It provides better control on computational resources.

Suggested implementation: Config option

params {
   ...
   empirical_assembly_sample_n = 200
}

(not real code ... just meant to show where in the workflow it would happen)

if (len(all_files) > params['empirical_assembly_sample_n'] ) {
    empirical_assembly_files = all_files
        .randomSample( params['empirical_assembly_sample_n'] )
} else {
    empirical_assembly_files = all_files
}
empirical_assembly_files = all_files
    .randomSample( params['empirical_assembly_sample_n'] )

first_search_results = FIRST_SEARCH(empirical_assembly_files)
empirical_assembly = EMPIRICAL_ASSEMBLY_FILES(first_search_results)

final_individual_results = FINAL_INDIV_SEARCH(all_files.mix(empirical_assembly))
ypriverol commented 8 months ago

As @jspaezp mentioned, we should refine the current proposal for skipping the assembly library and pre-analysis using the following logic:

This will help us to perform the pre-analysis on a certain number of files only depending on the size of the cluster and the resources, by making it false, it will use always all the files. For really large datasets users can define the number of files they want to use.

My main questions is how the previous PR #335 by @jspaezp overlaps with this idea. In the previous PR the user needs to do job1 and job2 configurations manually?

ypriverol commented 1 month ago

This has been done.