Relevance of input_regions parameter for WGS and effect of number of samples

npatel22526 commented 6 years ago

Hello,

First of all, a great concept for detecting STR from sort read sequencing data.

I was testing tool by providing a bed file with option "input_regions". The bed file is subset of " hg19.simpleRepeat_period1-6_dedup.sorted.bed" from five chromosomes that I was interested in. However, the resulting "STRs.tsv" and "Sample.STRs.tsv" files contain sites from all chromosomes, is it expected? If I have to run the analysis again with all STR sites should I provide with bed file with “input_regions” or not use the option at all? And also is there a way so that I can skip "median_cov" and "mosdepth.global.dist.txt" calculation as it's already been done once.

Initially I ran the analysis with 51 samples to test it but I would like to apply it to large set of 1000 WGS samples, what effect it would have if I ran all of them together Vs running in small batches. How many samples you would recommend in a batch ?

Thank You, Nick

hdashnow commented 6 years ago

Hi Nick,

Thanks for trying STRetch!

I was testing tool by providing a bed file with option "input_regions". The bed file is subset of " hg19.simpleRepeat_period1-6_dedup.sorted.bed" from five chromosomes that I was interested in.

input_regions should specify all the positions of all STRs in the genome, for example hg19.simpleRepeat_period1-6.dedup.sorted.bed. Using a bed file that only specifies some STRs may cause STRetch to miss expansions (see the documentation). This is because reads from STRs frequently mis-map during alignment to the standard reference genome. STRetch needs to know about all STR regions in order to retrieve mis-mapped reads. If you only want to look at a subset of STR loci, you can extract them from the final output file. If you remove the header it's a valid bed file, so you can then use a tools such as 'bedtools intersect' to filter.

And also is there a way so that I can skip "median_cov" and "mosdepth.global.dist.txt" calculation as it's already been done once.

If you rerun the the command in the same directory bpipe will automatically skip stages that have already been completed (as long as the data from a previous stage hasn't changed). See the bpipe docs for more info.

Initially I ran the analysis with 51 samples to test it but I would like to apply it to large set of 1000 WGS samples, what effect it would have if I ran all of them together Vs running in small batches. How many samples you would recommend in a batch ?

I must admit, the largest number of samples I've run in one batch is ~140. This depends a bit on your compute environment. If you have a large cluster you could use bpipe to submit jobs to the queue for you. So you could really run as many samples as you like at a time, and it will handle job submission etc. Each sample runs in parallel until the final stage. So really the only limitation is the final stage where all samples are combined to do the statistics. If you have mixture of affected and unaffected samples, you may want compute summary statistics from all the unaffected samples, then use these as controls for the affected samples (see here).

npatel22526 commented 6 years ago

Thank you very much Harriet for in detail response. This answers all my queries.

Best, Nick

Oshlack / STRetch

Relevance of input_regions parameter for WGS and effect of number of samples #32