naming scheme for the fastq files

josephhughes commented 2 years ago

Find a way so that if genomics change the names, there is an easy way to deal with it. Either within the snakemake or a utility script to use before running the script.

TriassicSalamander commented 2 years ago

It would depend on how the genomics team might change the format. Currently, it is assumed that demultiplexed fastqs will be of the format 'CVRXXXXX_SXX'. As we only wanted to keep the CVR number, we split by underscore and use the first field (the CVR number) for naming the output in subsequent steps. If the genomics team were to format sample names to include more underscores, this would cause a problem. For example, take two sample fastqs: 'CVR123_COND-A_SXX' and 'CVR123_COND-B_SXX'. Splitting by underscore and only taking the first field would result in both samples having identical names further on in the pipeline.

When splitting by underscore, it may be possible to 'drop' the last field (SXX), rather than worrying about whether to select the first 1 or 2 fields.

TriassicSalamander commented 2 years ago

I've changed how the sample names are shortened in the Snakefile and postReadAlign.smk rules 'aggregate_consensus', 'aggregate_masked_consensus' and 'make_climb_dirs'. Now if the genomics team uses extra underscores, it shouldn't cause any errors.

TriassicSalamander / nimagen_snakemake

naming scheme for the fastq files #2