Oshlack / STRetch

Method for detecting STR expansions from short-read sequencing data
MIT License
62 stars 15 forks source link

Error in `Stage median_cov_target` #25

Closed ifiddes-10x-zz closed 6 years ago

ifiddes-10x-zz commented 6 years ago

I am trying to use the STRetch_wgs_bam_pipeline.groovy pipeline with a previously generated BAM against GRCh38, targeting a specific set of STRs I have identified in a BED file. After taking an hour to walk the BAM (why does it not use random access with the index?), it crashes with an error about the reference:

=========================== Stage median_cov_target (NA24143_longranger) ===========================
2018/01/31 15:01:40 NA24143.bam
chromosome: STR-A not found in NA24143.bam
chromosome: STR-AAAAAC not found in NA24143.bam
chromosome: STR-AAAAAG not found in NA24143.bam
chromosome: STR-AAAAAT not found in NA24143.bam
chromosome: STR-AAAAC not found in NA24143.bam

... <removed lots of lines>

panic: runtime error: index out of range
goroutine 1 [running]:
github.com/brentp/goleft/covmed.BamInsertSizes(0xc42009a320, 0x186a0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
        /home/brentp/go/src/github.com/brentp/goleft/covmed/main.go:106 +0x6c5
github.com/brentp/goleft/covmed.Main()
        /home/brentp/go/src/github.com/brentp/goleft/covmed/main.go:153 +0x5d6
main.main()
        /home/brentp/go/src/github.com/brentp/goleft/main.go:65 +0x191

Am I misunderstanding how to use this pipeline? Do I need to be mapped against a special reference?

hdashnow commented 6 years ago

Hi @ifiddes-10x,

It looks like the error might be occurring within the goleft tool.

Could you please tell me a little more about how you have set up the pipeline? What bed file did you provide to the pipeline? If your original bam was on hg38, it is expecting the hg38 positions of all STRs in the genome.

The messages "chromosome: STR-AAAAAC not found in NA24143.bam" etc don't necessarily indicate a problem, if for example there are no AAAAAC expansions in your genome. The message is generated by bedtools, but I could probably suppress it.

ifiddes-10x-zz commented 6 years ago

I am using a bed file of only a few specific STRs that we are interested in. Could that be the problem? Should I find a genome wide one?

Other than that, I am doing all default parameters, with my custom STR BED file, on group of hg38 BAMs like so, in a subfolder I made to reduce clutter in the main folder: ../tools/bin/bpipe run ../pipelines/STRetch_wgs_bam_pipeline.groovy ../../strs.bed ../NA*bam

hdashnow commented 6 years ago

STRetch works by remapping reads to a custom reference genome with STR decoys. It works on the premise that these reads were likely to have been mis-mapped to the standard reference genome, because the expanded sequence is no present there. The first stage in the pipeline is extracting all reads that are likely to contain STRs. So, all reads mapping to any STR locus, flanking region, and unmapped reads. So you need to provide a bed file containing the locations of all STRs for it to work effectively. STRetch will then report on all loci, but you could filter the results to loci of interest at the end. I've uploaded the hg19 bed file. I'll create the hg38 bed file for you and post the link here in a few hours.

ifiddes-10x-zz commented 6 years ago

I see, I misunderstood. Thanks for the STR BED file!

hdashnow commented 6 years ago

@ifiddes-10x The hg38 bed file is here: https://figshare.com/s/198f7a51bd95ce16b3db

Usage: bpipe run STRetch/pipelines/STRetch_wgs_bam_pipeline.groovy hg38.simpleRepeat_period1-6.dedup.sorted.bed sample1.bam sample2.bam

At the moment STRetch ships with hg19, so it will remap against the hg19 + STR decoys, then give you results in terms of hg19. It is definitely time for me to generate and upload a hg38 reference, but it might take me a while to create and test appropriately.

ifiddes-10x-zz commented 6 years ago

Sorry, where did you upload the hg19 BED file?

On Wed, Jan 31, 2018 at 5:35 PM Harriet Dashnow notifications@github.com wrote:

@ifiddes-10x https://github.com/ifiddes-10x The hg38 bed file is here: https://figshare.com/s/198f7a51bd95ce16b3db

Usage: bpipe run STRetch/pipelines/STRetch_wgs_bam_pipeline.groovy hg38.simpleRepeat_period1-6.dedup.sorted.bed sample1.bam sample2.bam

At the moment STRetch ships with hg19, so it will remap against the hg19 + STR decoys, then give you results in terms of hg19. It is definitely time for me to generate and upload a hg38 reference, but it might take me a while to create and test appropriately.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Oshlack/STRetch/issues/25#issuecomment-362129499, or mute the thread https://github.com/notifications/unsubscribe-auth/AhnBndFi6Ec6NBeWOxW28Wi_E2el9Ib3ks5tQRT0gaJpZM4R0zXt .

--

hdashnow commented 6 years ago

Your input data is in hg38 so you will need to use the file linked above. Once reads are extracted the pipeline will switch to hg19. You should already have those reference files from the installation script

ifiddes-10x-zz commented 6 years ago

I have some inputs that are mapped to hg38, some to hg19. Is the hg19 STR file one of the ones that the install script placed in the reference-data folder?

On Wed, Jan 31, 2018 at 5:55 PM Harriet Dashnow notifications@github.com wrote:

Your input data is in hg38 so you will need to use the file linked above. Once reads are extracted the pipeline will switch to hg19. You should already have those reference files from the installation script

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Oshlack/STRetch/issues/25#issuecomment-362132919, or mute the thread https://github.com/notifications/unsubscribe-auth/AhnBnR5gxwLywaZWJ9BVNm2THvO7pfW-ks5tQRmYgaJpZM4R0zXt .

--

hdashnow commented 6 years ago

Yep

ifiddes-10x-zz commented 6 years ago

Alright thanks!

On Wed, Jan 31, 2018 at 5:59 PM Harriet Dashnow notifications@github.com wrote:

Yep

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Oshlack/STRetch/issues/25#issuecomment-362133615, or mute the thread https://github.com/notifications/unsubscribe-auth/AhnBndlxBoEF5eN6daWEj2A-V9wzNaM1ks5tQRqPgaJpZM4R0zXt .

--

ifiddes-10x-zz commented 6 years ago

I am still getting the same error in goleft with the correct bed file. Any ideas?

panic: runtime error: index out of range
goroutine 1 [running]:
github.com/brentp/goleft/covmed.BamInsertSizes(0xc42009a370, 0x186a0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
        /home/brentp/go/src/github.com/brentp/goleft/covmed/main.go:106 +0x6c5
github.com/brentp/goleft/covmed.Main()
        /home/brentp/go/src/github.com/brentp/goleft/covmed/main.go:153 +0x5d6
main.main()
        /home/brentp/go/src/github.com/brentp/goleft/main.go:65 +0x191

============================= Stage STR_coverage (NA24385_longranger) ==============================

=========================== Stage STR_locus_counts (NA24385_longranger) ============================
Traceback (most recent call last):
  File "/mnt/home/ian/yard/STR/STRetch/scripts/identify_locus.py", line 165, in <module>
    main()
  File "/mnt/home/ian/yard/STR/STRetch/scripts/identify_locus.py", line 93, in main
    readlen, count_noCIGAR = detect_readlen(bamfile)
TypeError: 'NoneType' object is not iterable
Cleaned up file NA24385.locus_counts to .bpipe/trash/NA24385.locus_counts
ERROR: Command failed with exit status = 1 :

STRPATH=$PATH; PATH=/mnt/home/ian/yard/STR/STRetch/tools/bin:$PATH; /mnt/home/ian/yard/STR/STRetch/tools/bin/python /mnt/home/ian/yard/STR/STRetch/scripts/identify_locus.py --bam NA24385.bam --bed /mnt/home/ian/yard/STR/STRetch/reference-data/hg19.simpleRepeat_period1-6_dedup.sorted.bed --output NA24385.locus_counts ;PATH=$STRPATH
hdashnow commented 6 years ago

That's an error I haven't seen before. You can see which commands it was trying to run incommandlog.txt in the working directory. I looks like goleft is complaining about reads with zero insert sizes. And later the STR_locus_counts is complaining about a NoneType object when reading the bamfile. Any chance the intermediate bam file is empty? Check the bam file feeding into that stage has reads in it, and that they have pairs.

ifiddes-10x-zz commented 6 years ago

That appears to be the case. The run folder ends up with 128 smaller BAMs from the samtools collate step, but the output FASTQ are empty. So something is wrong with this step cat <( /mnt/home/ian/yard/STR/STRetch/tools/bin/samtools view -hu -L combined_strs.slop.bed ../NA24385_longranger.GRCh37.bam ) <( /mnt/home/ian/yard/STR/STRetch/tools/bin/samtools view -u -f 4 ../NA24385_longranger.GRCh37.bam ) | /mnt/home/ian/yard/STR/STRetch/tools/bin/samtools collate -Ou -n 128 - NA24385_L001_R1.fastq | /mnt/home/ian/yard/STR/STRetch/tools/bin/bedtools bamtofastq -i - -fq >(gzip -c > NA24385_L001_R1.fastq.gz) -fq2 >(gzip -c > NA24385_L001_R2.fastq.gz). I can check any of the 128 smaller BAMs and see that they contain valid, paired reads. I am running the above step by hand to see if I can see what happens.

On Thu, Feb 1, 2018 at 2:37 PM Harriet Dashnow notifications@github.com wrote:

That's an error I haven't seen before. You can see which commands it was trying to run incommandlog.txt in the working directory. I looks like goleft is complaining about reads with zero insert sizes. And later the STR_locus_counts is complaining about a NoneType object when reading the bamfile. Any chance the intermediate bam file is empty? Check the bam file feeding into that stage has reads in it, and that they have pairs.

— You are receiving this because you modified the open/close state.

Reply to this email directly, view it on GitHub https://github.com/Oshlack/STRetch/issues/25#issuecomment-362425706, or mute the thread https://github.com/notifications/unsubscribe-auth/AhnBnSxrJ7EAA_oARtvPcu253bc0jmnHks5tQjy_gaJpZM4R0zXt .

--

hdashnow commented 6 years ago

Hi Ian,

Okay, that makes sense. If that step is failing, no reads are getting extracted so the later stages fail. I've had the same error reported by another user, also using hg38. I'm taking a look at his data today, so hopefully I can nut it out. Those small bam files are a by-product of the samtools collate sorting algorithm, so probably aren't much help. It does tell us the stage got up to sorting.

hdashnow commented 6 years ago

Hey Ian,

I have an idea, and if this is the problem I'm going to feel a bit silly! Are you using the pipeline on a cluster where jobs are given limited walltime?

I suspect the walltime for that job is set to the default, 2 hours. So it's getting cut off mid-way through. I previously only tested the STRetch WGS pipelines on a cluster that doesn't restrict walltime.

If you wouldn't mind editing the following:

In STRetch/pipelines/pipeline_stages.groovy, add , "bwamem" to the end of the triple quotes in extract_reads_region. It should be line 175 if you have the version that's currently in master. The stage should now look like this:

extract_reads_region = {

    doc "Extract reads from bam region + unaligned"

    def fastaname = get_fname(REF)

    produce(branch.sample + '_L001_R1.fastq.gz', branch.sample + '_L001_R2.fastq.gz') {
        exec """
            cat <( $samtools view -hu -L $input.bed $input.bam )
                <( $samtools view -u -f 4 $input.bam ) |
            $samtools collate -Ou -n 128 - $output.prefix |
            $bedtools bamtofastq -i - -fq >(gzip -c > $output1.gz) -fq2 >(gzip -c > $output2.gz)
        """,  "bwamem"
    }
}

And just to be on the safe side, let's increase the walltime to 48 hours. In STRetch/pipelines/bpipe.config, change the walltime for bwa, so it looks like this:


    bwamem {
        walltime="48:00:00"
        procs=8
    }```

I'm testing this too, but obviously it will take a day or two to get a result!
ifiddes-10x-zz commented 6 years ago

I am running it on a local machine. I will try your changes. I ran the command from the command log by hand and got 'Error reading input file':

bespin1 Thu Feb 01 14:48 test $cat <(
/mnt/home/ian/yard/STR/STRetch/tools/bin/samtools view -hu -L
combined_strs.slop.bed ../NA24385_longranger.GRCh37.bam ) <(
/mnt/home/ian/yard/STR/STRetch/tools/bin/samtools view -u -f 4
../NA24385_longranger.GRCh37.bam ) |
/mnt/home/ian/yard/STR/STRetch/tools/bin/samtools collate -Ou -n 128 -
NA24385_L001_R1.fastq | /mnt/home/ian/yard/STR/STRetch/tools/bin/bedtools
bamtofastq -i - -fq >(gzip -c > NA24385_L001_R1.fastq.gz) -fq2 >(gzip -c >
NA24385_L001_R2.fastq.gz)
Error reading input file

On Thu, Feb 1, 2018 at 4:27 PM Harriet Dashnow notifications@github.com wrote:

Hey Ian,

I have an idea, and if this is the problem I'm going to feel a bit silly! Are you using the pipeline on a cluster where jobs are given limited walltime?

I suspect the walltime for that job is set to the default, 2 hours. So it's getting cut off mid-way through. I previously only tested the STRetch WGS pipelines on a cluster that doesn't restrict walltime.

If you wouldn't mind editing the following:

In STRetch/pipelines/pipeline_stages.groovy, add , "bwamem" to the end of the triple quotes in extract_reads_region. It should be line 175 if you have the version that's currently in master. The stage should now look like this:

extract_reads_region = {

doc "Extract reads from bam region + unaligned"

def fastaname = get_fname(REF)

produce(branch.sample + '_L001_R1.fastq.gz', branch.sample + '_L001_R2.fastq.gz') {
    exec """
        cat <( $samtools view -hu -L $input.bed $input.bam )
            <( $samtools view -u -f 4 $input.bam ) |
        $samtools collate -Ou -n 128 - $output.prefix |
        $bedtools bamtofastq -i - -fq >(gzip -c > $output1.gz) -fq2 >(gzip -c > $output2.gz)
    """,  "bwamem"
}

}

And just to be on the safe side, let's increase the walltime to 48 hours. In STRetch/pipelines/bpipe.config, change the walltime for bwa, so it looks like this:

bwamem {
    walltime="48:00:00"
    procs=8
}```

I'm testing this too, but obviously it will take a day or two to get a result!

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/Oshlack/STRetch/issues/25#issuecomment-362447247, or mute the thread https://github.com/notifications/unsubscribe-auth/AhnBnbUc2P3WYgZcy0K1GmQc5QKywnVTks5tQlaCgaJpZM4R0zXt .

--

ifiddes-10x-zz commented 6 years ago

Those changes did not work. In the end, I am still seeing lots of small (~30mb) bam files, showing that samtools collate is doing its job properly and shuffling the reads, but the bedtools bamtofastq step doesn't work -- I end up with two empty .fastq.gz files. goleft then fails on these empty files.

hdashnow commented 6 years ago

Thanks, Ian. I've been able to replicate the error now. I'll let you know when I have a fix.

hdashnow commented 6 years ago

I have made some changes to the code to try to address this issue. Could you please update STRetch, re-install it (including the reference data) and the try again?

From the STRetch directory:

# Get the latest version
git pull 
# Delete old installation and reference data
rm -rf tools reference-data
# Reinstall STRetch (including reference-data)
./install.sh

  Note that pipeline_config.groovy will be overwritten by the install script. If you have made any changes to this (such as setting the control file) you will need to edit it again.   Let me know if you continue to have issues.