Oshlack / STRetch

Method for detecting STR expansions from short-read sequencing data
MIT License
62 stars 15 forks source link

Error in align_bwa_bam stage leads to a pipeline failure #35

Closed nmmsv closed 6 years ago

nmmsv commented 6 years ago

Thank you for the great method! I'm trying to run STRetch on a whole genome bam file using the whole genome bed (hg19.simpleRepeat_period1-6_dedup.sorted.bed). In the pipeline, I get an error in the alignment step (seems like there is a missing value for --T argument when calling samtools sort). The pipeline continues with calculating median etc, but fails in the end because of error in the alignment step. The log file is attached. Am I doing something incorrect?
Thanks! 47976.log

hdashnow commented 6 years ago

Hi @nmmsv,

Thanks for trying STRetch! I think I see what's happening here. Your input file name starts with an underscore _EGAR00001587255_repeat_expansions_NA07542.bam. STRetch is trying to extract the sample name as everything before the first underscore, but obviously that is an empty string! Thanks for finding this bug. I'll push a fix for this (#36). Could you please pull the latest version of STRetch from master and try again?

Warm regards, Harriet

nmmsv commented 6 years ago

Hi Harriet, Thanks for your prompt response! The issue in the sort seems to be fixed. However my pipeline still fails, I get the following error:

ERROR: There were no reads mapping to chromosomes with names starting with "STR-" in _EGAR00001587249_repeat_expansions_NA13506.bam. Are you sure this data is mapped to a reference genome with STR decoy chromosomes? Does this mean that I my bam file should be aligned to the ref genome with decoys? Is there a procedure for using bams that are already aligned to hg19? Thanks again! Best, Nima

hdashnow commented 6 years ago

Hi @nmmsv Could you please take a look at the _EGAR00001587249_repeat_expansions_NA13506.bam that was generated by STRetch? Does it actually contain reads? The file size is usually around 6 GB for a 40X whole genome to give you an idea. What genome was your original input bam file mapped against? Check the bam header if you're unsure. The most common reason why the STRetch bam file doesn't contain reads is that the input bed file doesn't match up with the reference genome used to map the input bam. For example in hg19 the chromosomes are named "chr1", "chr2", while in GRCh37 they are named "1" "2" etc.

nmmsv commented 6 years ago

I don't think there is any new bam file generated by STRetch. I only have the original bam (~69 GB) in the directory that I'm running (ls -lh output attached below). Should I specify an output directory for the bam file? Or is it going to be generated wherever the original bam resides?

total 69G -rw-rw-r-- 1 nmmsv nmmsv 10K Sep 27 08:43 commandlog.txt -rw-rw-r-- 1 nmmsv nmmsv 69G Sep 26 12:33 _EGAR00001587249_repeat_expansions_NA13506.bam -rw-rw-r-- 1 nmmsv nmmsv 8.5M Sep 26 12:33 _EGAR00001587249_repeat_expansions_NA13506.bam.bai -rw-rw-r-- 1 nmmsv nmmsv 3 Sep 27 08:43 _EGAR00001587249_repeat_expansions_NA13506._EGAR00001587249_repeat_expansions_NA13506.median_cov -rw-rw-r-- 1 nmmsv nmmsv 860K Sep 27 08:43 _EGAR00001587249_repeat_expansions_NA13506._EGAR00001587249_repeat_expansions_NA13506.mosdepth.global.dist.txt -rw-rw-r-- 1 nmmsv nmmsv 9.6K Sep 27 08:43 _EGAR00001587249_repeat_expansions_NA13506.STR_counts -rw-rw-r-- 1 nmmsv nmmsv 255 Sep 26 12:34 run_STRetch.s

Thanks! Nima

hdashnow commented 6 years ago

Hi @nmmsv,

I'd written the code assuming the input bam file would be in a different directory to the working directory. So it's getting confused thinking that it has already processed that sample when in fact it is just seeing the input file.

I've just fixed it so it will name those two files differently. Could you please update and have another go?

Thanks for finding these bugs!

nmmsv commented 6 years ago

Thank you for this update! I think the pipeline finished successfully! Best, Nima

nmmsv commented 6 years ago

Thanks again for your help. It seems like I have STR results in _EGAR00001587249_repeat_expansions_NA13506.STRs.tsv. This file contains 1086 loci and estimates for each one. The original bed had +300k loci. Are the rest not called because they don't have an allele surpassing read length (they are too small to have a read mapped to decoy)?

hdashnow commented 6 years ago

Yep, STRetch only reports a locus if there is any evidence of expansion i.e. at least one read from that locus that maps to the STR decoys. However, of course you can't definitely say that the other loci are small, because there may be no coverage over them for example. But in generally, yes, for good coverage WGS the ones that are not reported are likely to be small.

hdashnow commented 6 years ago

I should also note, that running a single sample with no controls is likely to produce less useful results. If you haven't already, I suggest either editing the "CONTROL" line in pipeline_config.groovy to use the provided controls or running with a set of controls if you have some.

nmmsv commented 6 years ago

Thank you very much for helping me run this method! Just to be clear, are the STR length estimates (just the numbers, how many copies for each locus) also affected by control samples? Or control is only used to do stats test for expansion?

hdashnow commented 6 years ago

Correct, the STR length estimates do not depend on the controls. Controls are just used to generate p-values. But without p-values figuring out which expansion is interesting is pretty tricky. As you've seen, you can get 1000 or more results for sample, but only a small portion of these will actually be significant.

nmmsv commented 6 years ago

Great! Thanks for the information! :)