Oshlack / STRetch

Method for detecting STR expansions from short-read sequencing data
MIT License
62 stars 15 forks source link

Using different STR reference set #50

Closed nmmsv closed 5 years ago

nmmsv commented 5 years ago

Hi, I'm trying to use a different reference set (STR bed file) with STRetch, and I'm having some trouble. I've converted my ref to the format that I believe should be appropriate:

chr1 10000 10466 TAACCC 77 chr1 10485 10496 GCCC 2 chr1 14069 14080 CCTC 2 chr1 16619 16630 GCT 3 chr1 22811 22820 AGGAA 1 chr1 26453 26464 GT 5 chr1 30854 30954 TC 50 chr1 31555 31569 AAAAT 2

Then I update the pipeline config such that STR_BED points to this reference. When I'm running bpipe, I also set input_regions=path/to/new/ref.bed, because I wasn't sure which parameter should be set to use a different STR reference set.

With these setting, the pipeline fails with this error: ERROR: file /PATH/NA12878_S1.STRdecoy.locus_counts contained 0 loci.

I checked STRdecoy.STR_counts file, and I can see non-zero values. (So I imagine alignment should be fine)

STR-A 0 1999 1392 STR-AAAAAC 0 1999 0 STR-AAAAAG 0 1999 0 STR-AAAAAT 0 1999 5 STR-AAAAC 0 1999 5 STR-AAAACC 0 1999 0 STR-AAAACG 0 1999 0 STR-AAAACT 0 1999 0 STR-AAAAG 0 1999 125 STR-AAAAGC 0 1999 0

Thank you for your help! Nima

hdashnow commented 5 years ago

Hi Nima,

The format of that bed file looks fine. The last column, the number of repeat units, accepts floating point values, so rounding them off like that may make a small difference to the size estimates. But they'll only be off by less than a repeat unit, so nothing worth worrying about.

So just to confirm, you're using the STRetch_wgs_bam_pipeline.groovy pipeline?

The most common cause of the issue you're describing is that the input bam file was aligned to a different reference genome to the one you provide as input_regions. For example if your bam was aligned with hg38 but your bed file describes hg19, or more commonly both are on the same genome build but one has chromosomes named 'chr1', 'chr2' and the other uses '1', '2' etc.

Note that the bed files specified for input_regions and STR_BED are not necessarily the same.

Also, have you looked through the custom reference data tutorial? https://github.com/Oshlack/STRetch/wiki/Reference-Data In particular, have you confirmed that there are no duplicate loci, and that the file is correctly sorted? i.e. these two commands:

# Remove duplicate lines (lines where two different repeat units are annotated at the same locus):
Rscript STRetch/scripts/dedupSTRannotation.R -i hg38.trf_period1-6.bed -o hg38.trf_period1-6.dedup.bed

# Sort:
bedtools sort -i hg38.trf_period1-6.dedup.bed > hg38.trf_period1-6.dedup.sorted.bed

If those suggestions don't help and you'd like me to look into this further, please provide the full STRetch command and the part of the header from the input bam file that specifies the chromosome names.

Warm regards, Harriet

nmmsv commented 5 years ago

Hi Harriet, Thanks for detailed response. I deduped and sorted my reference, and I still get the same error. Here is some more information. Thank you for your help!

Command:

/storage/nmmsv/STRetch/tools/bin/bpipe run -p input_regions=/storage/nmmsv/analysis/GangSTR-analyses/genome_wide/STRetch/refs/gangstr_exp8_for_STRetch.dedup.sorted.bed /storage/nmmsv/STRetch/pipelines/STRetch_wg\ s_bam_pipeline.groovy /storage/nmmsv/plat_genome/NA12878_S1.bam

Bam header:

@SQ SN:chrM LN:16571 @SQ SN:chr1 LN:249250621 @SQ SN:chr2 LN:243199373 @SQ SN:chr3 LN:198022430 @SQ SN:chr4 LN:191154276 @SQ SN:chr5 LN:180915260 @SQ SN:chr6 LN:171115067 @SQ SN:chr7 LN:159138663 @SQ SN:chr8 LN:146364022 @SQ SN:chr9 LN:141213431 @SQ SN:chr10 LN:135534747 @SQ SN:chr11 LN:135006516 @SQ SN:chr12 LN:133851895 @SQ SN:chr13 LN:115169878 @SQ SN:chr14 LN:107349540 @SQ SN:chr15 LN:102531392 @SQ SN:chr16 LN:90354753 @SQ SN:chr17 LN:81195210 @SQ SN:chr18 LN:78077248 @SQ SN:chr19 LN:59128983 @SQ SN:chr20 LN:63025520 @SQ SN:chr21 LN:48129895 @SQ SN:chr22 LN:51304566 @SQ SN:chrX LN:155270560 @SQ SN:chrY LN:59373566 @RG ID:NA12878 SM:NA12878 @PG ID:bwa PN:bwa VN:0.6.1-r104-tpx

and here is part of the pipeline config file:

// For exome pipeline only Edit before running the exome pipeline EXOME_TARGET="/storage/nmmsv/analysis/GangSTR-analyses/new_sim/results/STRetch/loci/HTT.bed" // Uncomment the line below to run the STRetch installation test, or specify your own //EXOME_TARGET="SCA8_region.bed"

// For bam pipeline only Edit before running if using CRAM input format CRAM_REF="path/to/reference_genome_used_to_create_cram.fasta" // Path to reference data //refdir="/storage/nmmsv/STRetch/reference-data" refdir="/storage/nmmsv/analysis/GangSTR-analyses/genome_wide/STRetch/refs"

// Decoy reference assumed to have matching .genome file in the same directory REF="/storage/nmmsv/STRetch/reference-data/hg19.STRdecoys.sorted.fasta" //STR_BED="/storage/nmmsv/STRetch/reference-data/hg19.simpleRepeat_period1-6_dedup.sorted.bed" STR_BED="/storage/nmmsv/analysis/GangSTR-analyses/genome_wide/STRetch/refs/gangstr_exp8_for_STRetch.dedup.sorted.bed" DECOY_BED="/storage/nmmsv/STRetch/reference-data/STRdecoys.sorted.bed" // By default, uses other samples in the same batch as a control CONTROL="" // Uncomment the line below to use a set of WGS samples as controls, or specify your own CONTROL="/storage/nmmsv/STRetch/reference-data/hg19.PCRfreeWGS_143_STRetch_controls.tsv"

Thanks!

hdashnow commented 5 years ago

Hi Nima,

That is a puzzle. Is the bed file tab-delimited? If not, try with tabs.

And is everything working when you keep everything the same but use the STRetch-provided bed file? I just want to eliminate the possibility that there's something else going on.

Would you be willing to send me the bed file so I can test it out?

Warm regards, Harriet

nmmsv commented 5 years ago

I just tried setting STR_BED variable in pipeline config and input_regions in runtime commands back to the originial STRetch reference, and I got the same error. I'm running in the directory that already has the files from partial run using the other reference. Do I need to rerun from scratch and do alignment step all over to see if it still works with the original ref?

I made sure that my reference is tab delimited, and I've uploaded it here: https://drive.google.com/open?id=1LvBqENdk0d4ebiiS1S3CLCxdC9t7frQJ

Thanks for your help! Nima

hdashnow commented 5 years ago

Hi Nima,

Yes, you would need to delete intermediate files, as changing input_regions will change the contents of the intermediate bam file even if no error occurs at that point.

I ran through an exome sample with your bed file linked above, and it worked fine (correctly identified a known expansion). There are many fewer results as it lacks homopolymers, but that's not an issue if you're not interested in those.

$STRETCH/tools/bin/bpipe run -n 100 \
  -p EXOME_TARGET="target_region.flattened.bed" \
  -p input_regions=$STRETCH/reference-data/STRetch.bed \
  -p STR_BED=$STRETCH/reference-data/STRetch.bed \
  $STRETCH/pipelines/STRetch_exome_bam_pipeline.groovy \
  sample_SCA8.merge.bam

Warm regards, Harriet

nmmsv commented 5 years ago

Hi Harriet, Thanks for the help. I'll try running the wgs pipeline one more time with the correct reference. I'll let you know if I ran into a problem! Thanks. Nima

nmmsv commented 5 years ago

Hello, I just got a new error in the alignment step. Is this something familiar to you?

[M::process] 0 single-end sequences; 346 paired-end sequences [M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (8, 1, 0, 3) [M::mem_pestat] skip orientation FF as there are not enough pairs [M::mem_pestat] skip orientation FR as there are not enough pairs [M::mem_pestat] skip orientation RF as there are not enough pairs [M::mem_pestat] skip orientation RR as there are not enough pairs [M::mem_process_seqs] Processed 346 reads in 0.971 CPU sec, 0.419 real sec [main] Version: 0.7.15-r1140 [main] CMD: /storage/nmmsv/STRetch/tools/bin/bwa mem -p -M -t 7 -R @RG\tID:NA12878_S1\tPL:illumina\tPU:NA\tLB:NA\tSM:NA12878_S1 /storage/nmmsv/STRetch/reference-data/hg19.STRdecoys.sorted.fasta - [main] Real time: 47023.157 sec; CPU: 314309.855 sec [bam_sort_core] merging from 69 files and 1 in-memory blocks... bash: line 1: 114166 Killed java -Xmx16g -Dsamjdk.reference_fasta=path/to/reference_genome_used_to_create_cram.fasta -jar /storage/nmmsv/STRetch/tools/bin/bazam.jar -pad 5 -n 6 -L /storage/nmmsv/analysis/GangSTR-analyses/genome_wide/STRetch/refs/gangstr_exp8_for_STRetch.dedup.sorted.bed -bam /storage/nmmsv/plat_genome/NA12878_S1.bam 114167 Done | /storage/nmmsv/STRetch/tools/bin/bwa mem -p -M -t 7 -R "@RG\tID:NA12878_S1\tPL:illumina\tPU:NA\tLB:NA\tSM:NA12878_S1" /storage/nmmsv/STRetch/reference-data/hg19.STRdecoys.sorted.fasta - 114168 Done | /storage/nmmsv/STRetch/tools/bin/samtools view -bSuh - 114169 Done | /storage/nmmsv/STRetch/tools/bin/samtools sort -o NA12878_S1.STRdecoy.bam -T NA12878_S1.STRdecoy Cleaned up file NA12878_S1.STRdecoy.bam to .bpipe/trash/NA12878_S1.STRdecoy.bam.2 ERROR: stage align_bwa_bam failed: Command in stage align_bwa_bam failed with exit status = 137 :

set -o pipefail; java -Xmx16g -Dsamjdk.reference_fasta=path/to/reference_genome_used_to_create_cram.fasta -jar /storage/nmmsv/STRetch/tools/bin/bazam.jar -pad 5 -n 6 -L /storage/nmmsv/analysis/GangSTR-analyses/genome_wide/STRetch/refs/gangstr_exp8_for_STRetch.dedup.sorted.bed -bam /storage/nmmsv/plat_genome/NA12878_S1.bam | /storage/nmmsv/STRetch/tools/bin/bwa mem -p -M -t 7 -R "@RG\tID:NA12878_S1\tPL:illumina\tPU:NA\tLB:NA\tSM:NA12878_S1" /storage/nmmsv/STRetch/reference-data/hg19.STRdecoys.sorted.fasta - | /storage/nmmsv/STRetch/tools/bin/samtools view -bSuh - | /storage/nmmsv/STRetch/tools/bin/samtools sort -o NA12878_S1.STRdecoy.bam -T NA12878_S1.STRdecoy

========================================= Pipeline Failed ==========================================

One or more parallel stages aborted. The following messages were reported:

-------------------------------------- align_bwa_bam ( 1 ) ---------------------------------------

Command in stage align_bwa_bam failed with exit status = 137 :

set -o pipefail; java -Xmx16g -Dsamjdk.reference_fasta=path/to/reference_genome_used_to_create_cram.fasta -jar /storage/nmmsv/STRetch/tools/bin/bazam.jar -pad 5 -n 6 -L /storage/nmmsv/analysis/GangSTR-analyses/genome_wide/STRetch/refs/gangstr_exp8_for_STRetch.dedup.sorted.bed -bam /storage/nmmsv/plat_genome/NA12878_S1.bam | /storage/nmmsv/STRetch/tools/bin/bwa mem -p -M -t 7 -R "@RG\tID:NA12878_S1\tPL:illumina\tPU:NA\tLB:NA\tSM:NA12878_S1" /storage/nmmsv/STRetch/reference-data/hg19.STRdecoys.sorted.fasta - | /storage/nmmsv/STRetch/tools/bin/samtools view -bSuh - | /storage/nmmsv/STRetch/tools/bin/samtools sort -o NA12878_S1.STRdecoy.bam -T NA12878_S1.STRdecoy


Thanks! Nima

hdashnow commented 5 years ago

Hi Nima,

exit status = 137 may indicate that your cluster killed the process. Check memory requirements and walltime, as these are common reasons. There should be some way for your cluster to tell you what the error was. You'd have to talk to your IT as it's specific to the system.

Seems like this has morphed away from the original issue, so I'm going to close this for now. Let me know if there's anything else I can do to help.

Warm regards, Harriet