NYU-Molecular-Pathology / snsxt

bioinformatics pipeline framework for data analysis
GNU General Public License v3.0
4 stars 3 forks source link

Revise analysis sample output file path retrieval #9

Closed stevekm closed 7 years ago

stevekm commented 7 years ago

Currently, files for a given sample for a given step in the analysis pipeline are retrieved through filename pattern matching as per here:

# file matching pattern based on the sample's id
self.search_pattern = '{0}*'.format(self.id)

This may not be specific enough to prevent matching the wrong file(s) if two or more samples in an analysis have similar names, such as

Sample1
Sample11

Need to revise this to do a more exact search to prevent the possibility of mis-matches. Consider creating an 'expected' exact filename and doing a search for an exact match. Alternatively, consider using samples.*.csv files output by sns with paths to expected files, or record the output paths of files in snsxt analysis steps for later retrieval.

Also for reference, the file retrieval class method and find module

stevekm commented 7 years ago

started documenting more thorough sample filepath retrieval method here: https://github.com/NYU-Molecular-Pathology/snsxt/blob/961be0f27dfdfd3beed497dc9fcc477ebecd7f62/snsxt/sns_tasks/_DemoQsubSampleTask.py#L38

stevekm commented 7 years ago

also need to reconsider consistency between sample filepath retrieval methods that return a list vs. file function methods that use character string as input, also situations where multiple files might be returned or needed from a single step. Might need to enforce filepath lists more globally throughout the program and submodules.

stevekm commented 7 years ago

updated sample filepath retrieval method here https://github.com/NYU-Molecular-Pathology/snsxt/blob/529cac27822869127f36e4449bacd33e3232dfe6/snsxt/sns_tasks/_DemoQsubSampleTask.py#L37