Add NAIBR for LSV calling

pontushojer commented 4 years ago

This builds on the branch lsv-calling by @FrickTobias.

Relates to https://github.com/FrickTobias/BLR/issues/131

Phased reads are here used to call structural variants using NAIBR. NAIBR needs to be run within its base directory therefore the directory has to be changed in the lsv_calling rule. The NAIBR package can be provided through the config (if you want to run a specific version) but otherwise the latest version will be temporarily cloned for use. As NAIBR also has specific dependancies, such as python 2.7, a separate environment had been added which is build and used by snakemake for the particular rule.

The output from NIABR is stated as a BEDPE but is not actually a BEDPE format, rather a TSV. Therefore I have added a rule to translate the TSV data into BEDPE for easier visualisation in for example IGV.

pontushojer commented 4 years ago

Now the checks have passed but looking at the total runtime for the tests this is taking about 8 min which is ~2 min longer than before. I will post a new issue to reduce runtime for these tests but for this PR I think i will leave it.

HSiga commented 4 years ago

I got an error while testing this branch (lsv-calling) on blr-testdata-0.5: _Error in rule markduplicates, jobid: 10

mapped.mark_duplicates.log

[Thu Sep 24 15:32:38 CEST 2020] MarkDuplicates BARCODE_TAG=BX INPUT=[mapped.sorted.tag.bcmerge.bam] OUTPUT=mapped.sorted.tag.bcmerge.mkdup.bam METRICS_FILE=mapped.sorted.tag.bcmerge.mkdup_metrics.txt ASSUME_SORTED=true USE_JDK_DEFLATER=true USE_JDK_INFLATER=true    MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=8000 SORTING_COLLECTION_SIZE_RATIO=0.25 TAG_DUPLICATE_SET_MEMBERS=false REMOVE_SEQUENCING_DUPLICATES=false TAGGING_POLICY=DontTag CLEAR_DT=true DUPLEX_UMI=false ADD_PG_TAG_TO_READS=true REMOVE_DUPLICATES=false DUPLICATE_SCORING_STRATEGY=SUM_OF_BASE_QUALITIES PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicates READ_NAME_REGEX=<optimized capture of last three ':' separated fields as numeric values> OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 MAX_OPTICAL_DUPLICATE_SET_SIZE=300000 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json
[Thu Sep 24 15:32:38 CEST 2020] Executing as humam@rackham3.uppmax.uu.se on Linux 3.10.0-1127.19.1.el7.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_151-b12; Deflater: Jdk; Inflater: Jdk; Provider GCS is not available; Picard version: 2.23.4
INFO    2020-09-24 15:32:39     MarkDuplicates  Start of doWork freeMemory: 502246952; totalMemory: 514850816; maxMemory: 954728448
INFO    2020-09-24 15:32:39     MarkDuplicates  Reading input file and constructing read end information.
INFO    2020-09-24 15:32:39     MarkDuplicates  Will retain up to 2946692 data points before spilling to disk.
[Thu Sep 24 15:32:39 CEST 2020] picard.sam.markduplicates.MarkDuplicates done. Elapsed time: 0.01 minutes.
Runtime.totalMemory()=514850816
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
Exception in thread "main" htsjdk.samtools.SAMFormatException: Error parsing SAM header. Problem parsing @PG key:value pair. Line:
@PG     PN:bowtie2      ID:bowtie2      VN:     CL:"/home/humam/miniconda3/envs/blr_lsv/bin/bowtie2-align-s --wrapper basic-0 -p 4 --rg-id 1 --rg LB:blr --rg SM:20 --rg PU:unit1 --rg PL:ILLUMINA --reorder --maxins 2000 -x /home/humam/private/scilife/AK_HS/human_genome/genome.fa -1 trimmed.barcoded.1.fastq.gz -2 trimmed.barcoded.2.fastq.gz"; File /domus/h1/humam/private/scilife/blr_lsv/BLR/blr-testdata-0.5/outblr/mapped.sorted.tag.bcmerge.bam; Line number 198
        at htsjdk.samtools.SAMTextHeaderCodec.reportErrorParsingLine(SAMTextHeaderCodec.java:258)
        at htsjdk.samtools.SAMTextHeaderCodec.access$200(SAMTextHeaderCodec.java:46)
        at htsjdk.samtools.SAMTextHeaderCodec$ParsedHeaderLine.<init>(SAMTextHeaderCodec.java:307)
        at htsjdk.samtools.SAMTextHeaderCodec.decode(SAMTextHeaderCodec.java:97)
        at htsjdk.samtools.BAMFileReader.readHeader(BAMFileReader.java:704)
        at htsjdk.samtools.BAMFileReader.<init>(BAMFileReader.java:298)
        at htsjdk.samtools.BAMFileReader.<init>(BAMFileReader.java:176)
        at htsjdk.samtools.SamReaderFactory$SamReaderFactoryImpl.open(SamReaderFactory.java:396)
        at picard.sam.markduplicates.util.AbstractMarkDuplicatesCommandLineProgram.openInputs(AbstractMarkDuplicatesCommandLineProgram.java:262)
        at picard.sam.markduplicates.MarkDuplicates.buildSortedReadEndLists(MarkDuplicates.java:508)
        at picard.sam.markduplicates.MarkDuplicates.doWork(MarkDuplicates.java:257)
        at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:301)
        at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:103)
        at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:113)

pontushojer commented 4 years ago

I got an error while testing this branch (lsv-calling) on blr-testdata-0.5:

@HSiga You probably are using the wrong bowtie2 version, picard fails for bowtie2 > 2.4 (see issue https://github.com/BenLangmead/bowtie2/issues/301). I have limited this in the current version of the environment.yaml file so try and update your conda env using this.

pontushojer commented 3 years ago

I have run tests locally that pass as well as several large-scale tests on real data. Seems to be stable enough to merge.

AfshinLab / BLR

Add NAIBR for LSV calling #40

mapped.mark_duplicates.log