PapenfussLab / gridss

GRIDSS: the Genomic Rearrangement IDentification Software Suite
Other
255 stars 71 forks source link

Got results, what now? #141

Closed yeroslaviz closed 6 years ago

yeroslaviz commented 6 years ago

Hi, I am not sure, whether this is the right place to put such a question, but i am not sure how to proceed. I have done the gridss analysis and got loads of files and results. How can I make sense out of it? Is there a way to conveniently visualize or filter the results to get only specific Indels?

As mentioned in earlier questions, I am mainly interested in possible insertions of my transgene in the mouse genome. You mentioned Socrates for this kind of analysis, but is it possible to extract the same results from the gridss analysis. The transgene size is approx. 6.2Mb. Does it make sense to look for Indels of this size in the vcf file?

thanks Assa

d-cameron commented 6 years ago

You mentioned Socrates for this kind of analysis, but is it possible to extract the same results from the gridss analysis.

Yes. The Socrates paper included a transgene analysis example. We repeated this example for GRIDSS and identified all transgene-related breakpoints. GRIDSS is preferable for this sort of analysis as it will give you longer breakpoint assemblies.

The transgene size is approx. 6.2Mb. Does it make sense to look for Indels of this size in the vcf file?

No. The only way you'll get a 6.2Mb insert called from short read sequence data is from de novo assembly and event then, you'd need to be lucky to get a contig that long.

For transgene insertion site identification you want to:

Insertion sites can then be identified by looking for inter-chromosomal variants between the transgene and the reference genome.

yeroslaviz commented 6 years ago

For transgene insertion site identification you want to:

  • add transgene sequence as an additional record in your fasta file
  • align to the combined fasta containing reference + transgene
  • call structural variants

This I have done with the gridss workflow. Now I have a lot of results files. I am sorry to say, that I can't understand the workflow in the README.md file to understand how to sieve through the results and identify what i am looking for. I have run the script

#!/bin/bash
#
# Example gridss pipeline for single sample analysis
#
INPUT=../bwa.mapped/trimmomatic.combined.sorted.bam
#BLACKLIST=wgEncodeDacMapabilityConsensusExcludable.bed
REFERENCE=/home/yeroslaviz/projects/Guru/WGS/bwa.Indexed/combined/Combined.fa
OUTPUT=${INPUT/.bam/.sv.vcf}
ASSEMBLY=${OUTPUT/.sv.vcf/.gridss.assembly.bam}
GRIDSS_JAR=~/software/gridss-1.5.1-jar-with-dependencies.jar

if [[ ! -f "$INPUT" ]] ; then
        echo "Missing $INPUT input file."
        exit 1
fi
if ! which bwa >/dev/null 2>&1 ; then
        echo "Missing bwa. Please add to PATH"
        exit 1
fi
if [[ ! -f "$REFERENCE" ]] ; then
        echo "Missing reference genome $REFERENCE. Update the REFERENCE variable in the shell script to your hg19 location"
        echo "For real data, please ensure that all BAM files are aligned to the same reference, and the reference supplied to GRIDSS matches that used for alignment."
        exit 1
fi
if [[ ! -f "$REFERENCE.bwt" ]] ; then
        echo "Missing bwa index for $REFERENCE. Could not find $REFERENCE.bwt. Create a bwa index (using \"bwa index $REFERENCE\") or symlink the index files to the expected file names."
        exit 1
fi
if [[ ! -f $GRIDSS_JAR ]] ; then
        echo "Missing $GRIDSS_JAR. Update the GRIDSS_JAR variable in the shell script to your location"
        exit 1
fi
if ! which java >/dev/null 2>&1 ; then
        echo "Missing java. Please add java 1.8 or later to PATH"
        exit 1
fi
JAVA_VERSION="$(java -version 2>&1 | grep version )"
if [[ ! "$JAVA_VERSION" =~ "\"1.8" ]] ; then
        echo "Detected $JAVA_VERSION. GRIDSS requires Java 1.8 or later."
        exit 1
fi
if ! which Rscript >/dev/null 2>&1 ; then
        echo "Missing R installation. Please add Rscript to PATH"
        exit 1
fi

java -ea -Xmx31g -Dsamjdk.create_index=true -Dsamjdk.use_async_io_read_samtools=true -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=true -Dgridss.gridss.output_to_temp_file=true -cp $GRIDSS_JAR gridss.CallVariants TMP_DIR=. R="$REFERENCE" WORKING_DIR=. INPUT="$INPUT" OUTPUT="$OUTPUT" ASSEMBLY="$ASSEMBLY" 2>&1 | tee -a gridss.$HOSTNAME.$$.log
#       BLACKLIST="$BLACKLIST" \

The first few lines just checks for existing files and the last line is the gridss script. This ran w.o any errors. I have got a lot of folders and files. The important ones are probably the vcf and bed files. Can you please help me to get a clear picture where and what to look for in this chaos?

thanks again

d-cameron commented 6 years ago

The VCF is the primary output file. Lets assume your transgene contig is call chr_transgene. If you've got decent coverage and neither your insertion site nor your transgene vector is unmappable, then the simplest way to look for the transgene insertion site is to just grep chr_transgene output.vcf | grep -v LOW_QUAL and look at the highest scoring variants in that subset. You should find high scoring variants (probably QUAL > 1000) at the start and end of the transgene that go to almost the same genomic location - that's your insertion site. Remember that it's possible to have multiple insertion sites.

By default, GRIDSS reports many many variants, the vast majority of these are low quality variants that you can ignore (hence the LOW_QUAL filter I recommended). In some cases, you might want to look at them. For example, if you found one side of the transgene insertion but not the other, you'll want to look for nearby low quality variants that complete the insertion.

I strongly recommend reading section 5.4 of https://samtools.github.io/hts-specs/VCFv4.3.pdf so you have an idea of what you're looking at.

If you want to know the exact transgene insertion sequence, that can be found in the ASSEMBLY file with the read name matching the BEID field of the transgene breakpoint call (you'll probably have more than one assembly but they should have the same sequence).

yeroslaviz commented 6 years ago

Thanks for the detailed response. Unfrotunately this is what I get when grepping for results:

$grep transgene cleaned.combined.sorted.sv.vcf
##contig=<ID=transgene,length=6252>
12      113428752       gridss56_9234o  T       ]transgene:6194]T       110.59  LOW_QUAL;SINGLE_ASSEMBLY        AS=0;ASQ=0.00;ASRP=6;ASSR=0;BA=0;BAQ=0.00;BEID=asm274-18991;BQ=0.00;BSC=0;BSCQ=0.00;BUM=0;BUMQ=0.00;CAS=0;CASQ=0.00;CIPOS=-57,58;CIRPOS=-57,58;CQ=110.59;EVENT=gridss56_9234;IC=0;IMPRECISE;IQ=0.00;PARID=gridss56_9234h;RAS=1;RASQ=94.16;REF=32;REFPAIR=18;RP=1;RPQ=16.43;SC=1X114N1X76M;SR=0;SRQ=0.00;SVTYPE=BND
        GT:ASQ:ASRP:ASSR:BAQ:BQ:BSC:BSCQ:BUM:BUMQ:CASQ:IC:IQ:QUAL:RASQ:REF:REFPAIR:RP:RPQ:SR:SRQ        .:0.00:6:0:0.00:0.00:0:0.00:0:0.00:0.00:0:0.00:110.59:94.16:32:18:1:16.43:0:0.00
7       36210719        gridss210_220o  A       A]transgene:6194]       125.73  LOW_QUAL;SINGLE_ASSEMBLY        AS=1;ASQ=48.00;ASRP=3;ASSR=0;BA=0;BAQ=0.00;BEID=asm210-15597;BQ=199.80;BSC=10;BSCQ=199.80;BUM=0;BUMQ=0.00;CAS=0;CASQ=0.00;CIPOS=-191,192;CIRPOS=-57,58;CQ=125.73;EVENT=gridss210_220;IC=0;IMPRECISE;IQ=0.00;PARID=gridss210_220h;RAS=0;RASQ=0.00;REF=7;REFPAIR=7;RP=5;RPQ=77.73;SC=318M1X382N1X;SR=0;SRQ=0.00;SVTYPE=BND  GT:ASQ:ASRP:ASSR:BAQ:BQ:BSC:BSCQ:BUM:BUMQ:CASQ:IC:IQ:QUAL:RASQ:REF:REFPAIR:RP:RPQ:SR:SRQ        .:48.00:3:0:0.00:199.80:10:199.80:0:0.00:0.00:0:0.00:125.73:0.00:7:7:5:77.73:0:0.00
transgene       6194    gridss210_220h  A       A]7:36210719]   125.73  LOW_QUAL;SINGLE_ASSEMBLY        AS=0;ASQ=0.00;ASRP=3;ASSR=0;BA=0;BAQ=0.00;BEID=asm210-15597;BQ=0.00;BSC=0;BSCQ=0.00;BUM=0;BUMQ=0.00;CAS=0;CASQ=0.00;CIPOS=-57,58;CIRPOS=-191,192;CQ=125.73;EVENT=gridss210_220;IC=0;IMPRECISE;IQ=0.00;PARID=gridss210_220o;RAS=1;RASQ=48.00;REF=0;REFPAIR=0;RP=5;RPQ=77.73;SC=273M121D197M1X114N1X;SR=0;SRQ=0.00;SVTYPE=BND       GT:ASQ:ASRP:ASSR:BAQ:BQ:BSC:BSCQ:BUM:BUMQ:CASQ:IC:IQ:QUAL:RASQ:REF:REFPAIR:RP:RPQ:SR:SRQ        .:0.00:3:0:0.00:0.00:0:0.00:0:0.00:0.00:0:0.00:125.73:48.00:0:0:5:77.73:0:0.00
transgene       6194    gridss56_9234h  A       A[12:113428752[ 110.59  LOW_QUAL;SINGLE_ASSEMBLY        AS=1;ASQ=94.16;ASRP=6;ASSR=0;BA=0;BAQ=0.00;BEID=asm274-18991;BQ=0.00;BSC=0;BSCQ=0.00;BUM=0;BUMQ=0.00;CAS=0;CASQ=0.00;CIPOS=-57,58;CIRPOS=-57,58;CQ=110.59;EVENT=gridss56_9234;IC=0;IMPRECISE;IQ=0.00;PARID=gridss56_9234o;RAS=0;RASQ=0.00;REF=0;REFPAIR=0;RP=1;RPQ=16.43;SC=151M39D1X114N1X;SR=0;SRQ=0.00;SVTYPE=BND    GT:ASQ:ASRP:ASSR:BAQ:BQ:BSC:BSCQ:BUM:BUMQ:CASQ:IC:IQ:QUAL:RASQ:REF:REFPAIR:RP:RPQ:SR:SRQ  .:94.16:6:0:0.00:0.00:0:0.00:0:0.00:0.00:0:0.00:110.59:0.00:0:0:1:16.43:0:0.00

Does this means, that no insertion site was identified with good enough quality to be sure about the event? Is there a way to re-run the analysis with less strict parameters to try and find more hits?

thanks again for all the help Assa

d-cameron commented 6 years ago

The transgene size is approx. 6.2Mb.

Your transgene is 6.2Kb. Did you mean Kb, or Mb? I'm going to assume it's a 6kb transgene.

What depth of coverage do you have? Those two insertion sites aren't actually that terrible if you don't have high coverage.

Is there a way to re-run the analysis with less strict parameters to try and find more hits?

I recommend adding CONFIGURATION_FILE=gridss.properties to your GRIDSS command line and creating a gridss.properties file containing the following:

variantcalling.callBreakends = true

Delete all the intermediates/outputs from the variant calling stage (rm -r *vcf*), rerun GRIDSS (you don't need to delete the input file .gridss.working nor any of the assemblies so this rerun should be fast), then grep transgene cleaned.combined.sorted.sv.vcf. Your output should now include single breakends (see Section 5.4.9 of https://samtools.github.io/hts-specs/VCFv4.3.pdf) which you can use to confirm/BLAST the insertion sites.

d-cameron commented 6 years ago

Another option is that your transgene (TG) has a strong sequence homology with a gene (G) in the reference genome. If this is the case, then all the reads from both TG and G will have very poor mapping quality. Check in IGV what the MAPQ score for the reads mapping to transgene are. If they're mostly low (<=10) that's a problem.

The simplest solution to this problem is to replace the homologous sequence G in the reference genome with a string of N bases. Aligning to the ref + TG with G masked out will prevent the MAPQ problem but all the reads from G will be placed in TG by the aligner. This isn't actually a big problem as you know the location of G in the reference genome so you can just ignore the breakpoints that say TG is connected to the flanking sequence of G.

yeroslaviz commented 6 years ago

Your transgene is 6.2Kb. Did you mean Kb, or Mb? I'm going to assume it's a 6kb transgene.

Yes sorry, 6252bp for precision.

What depth of coverage do you have? Those two insertion sites aren't actually that terrible if you don't have high coverage.

I think this is not such a straight forward answer. How do I check the depth of the coverage? Do you mean reads/bp? so basically : "how many bp [#read * length(reads)] were mapped" / "how many bp I have in the genome + transgene".

$samtools flagstat -@12 cleaned.combined.sorted.bam

727276955 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
4632165 + 0 supplementary
0 + 0 duplicates
725864438 + 0 mapped (99.81% : N/A)
722644790 + 0 paired in sequencing
361322395 + 0 read1
361322395 + 0 read2
707230990 + 0 properly paired (97.87% : N/A)
720039774 + 0 with itself and mate mapped
1192499 + 0 singletons (0.17% : N/A)
10813600 + 0 with mate mapped to a different chr
6387215 + 0 with mate mapped to a different chr (mapQ>=5)

'only' ~10M reads mapped to different chromosomes (~3%). Should this be enough?

Do you think it makes more sense to first map my paired-end sample to only the transgene sequence. Than use the unmapped reads (either mapping with bowtie, which creates an unmapped files or using samtools view -f 4 bam > unmapped.sam to extract unmapped reads from a bwa-mapped bam file) and map only them to the mouse genome (only).

I can than merge the two bam files and use the merged.bam as the input file for gridss. This will give me better mapping for the transgene.

Will gridss be able to identify inter-chromosomal events between the transgene and genomic chromosomes from this kind of mapping?

d-cameron commented 6 years ago

How do I check the depth of the coverage?

Just a simple #reads * length(reads) / length(genome).

If you've use 2x100bp sequencing then you have ~25x coverage.

I'm attempting to work out what your data issue is.

If you look at your transgene in IGV, what's your coverage? Do you have about 25 reads aligned to each position? 50? 10? What's the mapq of the reads aligned to your transgene?

Do you think it makes more sense to first map my paired-end sample to only the transgene sequence. Then...

Not really. GRIDSS requires the reference genome supplied to GRIDSS matches the reference used to align the input file and it'd be fiddly and error-prone to hack it together.

This will give me better mapping for the transgene.

It'll map all the reads that actually came from homology sequence in the mouse genome to the transgene. You can achieve the same outcome by masking (converting to Ns) the homology regions in your existing combined genome. If you BLAST your 6kp transgenic sequence against the mouse genome, what hits do you get? If you get good hits then that further supports the sequence homology hypothesis.

Aligning to just the transgene is problematic as you'll get a whole lot of false positive soft-clipped reads aligning to the transgene. Any low complexity sequence (e.g polyA) in the transgene will have all the reads from the mouse genome with similar low complexity sequence all mapping to the transgene. Ideally, you want to map to both.

yeroslaviz commented 6 years ago

If you BLAST your 6kb transgenic sequence against the mouse genome, what hits do you get? If you get good hits then that further supports the sequence homology hypothesis.

Here is the BLAST results of my 6.2Kb transgene seq against the mouse genome.

image

The first hit on chr 17 is ~2000 bases at the beginning of the transgene sequence (100% identities), the second hit from chr 12 ~1600 bases long at the end of the transgene sequence99% identities). If I understand it correctly, these are the important two regions to identify the insertion site. Do you think it has such an impact on the mapping results to the transgene?

Another option is that your transgene (TG) has a strong sequence homology with a gene (G) in the reference genome. If this is the case, then all the reads from both TG and G will have very poor mapping quality. Check in IGV what the MAPQ score for the reads mapping to transgene are. If they're mostly low (<=10) that's a problem.

Here are the mapping results for the transgene. The highest peaks is at 37 reads, but at both ends of the sequence it is a much smaller coverage. image

region 1-300 image

region 5900-6252 image

Do you think it would make sense to convert the two hits from the BLAST in to N's and rerun the mapping and the following workflow? It should give me more hits to the ends of the transgene.

d-cameron commented 6 years ago

Do you think it has such an impact on the mapping results to the transgene?

It definitely does. You can see in the IGV screenshot that every single read in the first 2kb of the transgene is multi-mapping (coloured white instead of grey). GRIDSS won't give you any hits at all on that side of the transgene since it ignores multi-mapping reads. A lot of the reads that do align there actually come from the homologous mouse sequence. You can clearly see this by the presence of the het SNV around position 1600. If you mask out the homologous sequences then you'll get all the reads mapping to the transgene, as well as apparent translocations between the transgene and the homology which you can just ignore.

Are the 7 SNV and 3 indels in the transgenic construct expected? Any difference could make the transgenic construct non-functional and it's somewhat concerning that the reads indicate that the sequenced transgene is so different from the transgene fasta.

region 5900-6252 looks better but the presence of mapq 0 reads is somewhat concerning. Is there low complexity sequence at this end of the transgene? What does this region look like in trimmomatic.combined.sorted.gridss.assembly.bam.gridss.working/trimmomatic.combined.sorted.gridss.assembly.bam.sv.bam in IGV if you enable viewing of soft-clipped bases? What you want to see is assembly contigs that have long right-pointing soft clips that hang off the end of the transgene. Are any of these present?

yeroslaviz commented 6 years ago

What does this region look like in trimmomatic.combined.sorted.gridss.assembly.bam.gridss.working/trimmomatic.combined.sorted.gridss.assembly.bam.sv.bam in IGV if you enable viewing of soft-clipped bases?

Here are the images of both ends with soft-clipping of the file you mentioned.

screenshot 2018-06-08 13 20 54 screenshot 2018-06-08 13 19 03

To be honest i am not really sure what i see here.

What you want to see is assembly contigs that have long right-pointing soft clips that hang off the end of the transgene. Are any of these present?

What are the soft-clipping I see in the images? Is each "read" stands here for a contig of reads containing multiple mapped reads. Why are there "only" three reads here?

yeroslaviz commented 6 years ago

So I will try to convert the positions identified in BLAST to N's and re-run the complete analysis. We'll see how better it can get :-)

d-cameron commented 6 years ago

What are the soft-clipping I see in the images? Is each "read" stands here for a contig of reads containing multiple mapped reads.

Each 'read' is a contig assembled from soft clipped reads, split reads, discordant read pairs, indel-containing reads, and read pairs with only one read mapping (i.e. all the SV-supporting reads), that mutually support a putative SV. These are the 'primary' read alignment. These are then realigned and supplementary records are created that align the soft-clipped bases to the other side of the breakpoint.

Why are there "only" three reads here?

In an ideal scenario, you would have two 'reads' (aka contigs) for each breakpoint. One from reads assembled at the transgene location, and one from reads assembled at the genomic insertion site.

To be honest i am not really sure what i see here.

What you see is the chimeric alignment of the assembled contigs.

asm274-18991 was assembled from reads that mapped past the end of the transgene contigs and appears to be a compound breakpoint involving both chr12 and chr7 (the "SupplementaryAlignments" section of the mouse-over).

asm209-77534 looks like it was assembled from reads that mapped to a different location of chr7 and mapped (very badly - mapq=0) to the transgene. Is chr7:28,985,940-28,986,315 flanking one of the regions of homology?

TL:DR it looks like you have chr7:28,986,315 connected to the transgene connected to chr7:26310529-36210210 connected to chr12:113428810. That doesn't seem very plausible to me. A clean transgene insertion would be chr:pos <-> transgene <-> chr:pos+-30ish base pairs (DNA repair can result in a handful of duplicated or removed bases at the insertion site depending on the repair mechanism used).

yeroslaviz commented 6 years ago

asm209-77534 looks like it was assembled from reads that mapped to a different location of chr7 and mapped (very badly - mapq=0) to the transgene. Is chr7:28,985,940-28,986,315 flanking one of the regions of homology?

If I understood the biologist, the first 2Kb of the transgene comes from endogenous mouse genomic sequence from chr 17, the last 1.6Kb from chr12. This is probably why it shows such bad mapq values, but it is not from chr7.

TL:DR it looks like you have chr7:28,986,315 connected to the transgene connected to chr7:26310529-36210210 connected to chr12:113428810. That doesn't seem very plausible to me. A clean transgene insertion would be chr:pos <-> transgene <-> chr:pos+-30ish base pairs (DNA repair can result in a handful of duplicated or removed bases at the insertion site depending on the repair mechanism used).

I'm not really sure how you inffered this from the two breakends. is it possible to have here two independent insertion points?

yeroslaviz commented 6 years ago

I have masked the mentioned regions (chr12 and chr17) from the BLAST results and re-run the analysis. This is how the mapping looks line now in comaprision

screenshot 2018-06-18 11 10 06

The top bam file was done against the masked genome, while the bottom one is the original with the lower qualities.

Instead of now just one contig on each side, I seems to have four/five different contigs. See images and details of each of them (from top to bottom) below:

5 end 3 end

The read information for each of the reads (=contigs) is under this link (if this is not working I can also paste it here).

I can see it looks better now than before, but I am not really sure why I see four contigs now on each side. Are these multiple possible insertion sites?

yeroslaviz commented 6 years ago

When grepping for transgene in the vcf results file I get the following hits:

grep transgene secondRun.maskedGenome/cleaned.combined.masked.sorted.sv.vcf
##contig=<ID=transgene,length=6252>
7       28985940        gridss102_10775o        T       [transgene:5[T  481.85  LOW_QUAL        AS=1;ASQ=189.18;ASRP=10;ASSR=8;BA=0;BAQ=0.00;BEID=asm102-77226,asm273-2053;BQ=59.13;BSC=2;BSCQ=31.68;BUM=1;BUMQ=27.46;CAS=0;CASQ=0.00;CIPOS=0,2;CIRPOS=-2,0;CQ=481.85;EVENT=gridss102_10775;HOMLEN=2;HOMSEQ=TC;IC=0;IHOMPOS=0,2;IQ=0.00;PARID=gridss102_10775h;RAS=1;RASQ=135.16;REF=24;REFPAIR=16;RP=6;RPQ=99.61;SC=1X1N1X373M;SR=3;SRQ=57.90;SVTYPE=BND GT:ASQ:ASRP:ASSR:BAQ:BQ:BSC:BSCQ:BUM:BUMQ:CASQ:IC:IQ:QUAL:RASQ:REF:REFPAIR:RP:RPQ:SR:SRQ        .:189.18:10:8:0.00:59.13:2:31.68:1:27.46:0.00:0:0.00:481.85:135.16:24:16:6:99.61:3:57.90
7       36210528        gridss103_684o  G       GAGGAATTCGGGAGCTTGAAGT]transgene:6232]  1317.41 .       AS=2;ASQ=598.95;ASRP=27;ASSR=16;BA=0;BAQ=0.00;BEID=asm103-15595,asm103-15624,asm273-1214;BQ=46.51;BSC=3;BSCQ=46.51;BUM=0;BUMQ=0.00;CAS=0;CASQ=0.00;CQ=1294.64;EVENT=gridss103_684;IC=0;IHOMPOS=21,-21;IQ=0.00;PARID=gridss103_684h;RAS=1;RASQ=166.01;REF=32;REFPAIR=11;RP=17;RPQ=282.22;SC=605M1X;SR=13;SRQ=270.22;SVTYPE=BND     GT:ASQ:ASRP:ASSR:BAQ:BQ:BSC:BSCQ:BUM:BUMQ:CASQ:IC:IQ:QUAL:RASQ:REF:REFPAIR:RP:RPQ:SR:SRQ        .:598.95:27:16:0.00:46.51:3:46.51:0:0.00:0.00:0:0.00:1317.41:166.01:32:11:17:282.22:13:270.22
12      113427284       gridss176_3143o G       G[transgene:4643[       3661.30 .       AS=1;ASQ=1280.81;ASRP=75;ASSR=69;BA=0;BAQ=0.00;BEID=asm176-20960,asm273-2055;BQ=117.43;BSC=8;BSCQ=117.43;BUM=0;BUMQ=0.00;CAS=0;CASQ=0.00;CIPOS=-1,0;CIRPOS=-1,0;CQ=3641.37;EVENT=gridss176_3143;HOMLEN=1;HOMSEQ=G;IC=0;IHOMPOS=-1,0;IQ=0.00;PARID=gridss176_3143h;RAS=1;RASQ=1217.11;REF=0;REFPAIR=0;RP=38;RPQ=630.86;SC=470M2X;SR=28;SRQ=532.52;SVTYPE=BND       GT:ASQ:ASRP:ASSR:BAQ:BQ:BSC:BSCQ:BUM:BUMQ:CASQ:IC:IQ:QUAL:RASQ:REF:REFPAIR:RP:RPQ:SR:SRQ        .:1280.81:75:69:0.00:117.43:8:117.43:0:0.00:0.00:0:0.00:3661.30:1217.11:0:0:38:630.86:28:532.52
12      113428877       gridss176_6753o A       ]transgene:6232]A       3474.18 .       AS=1;ASQ=1063.75;ASRP=77;ASSR=69;BA=0;BAQ=0.00;BEID=asm176-67713,asm273-1195;BQ=104.22;BSC=7;BSCQ=104.22;BUM=0;BUMQ=0.00;CAS=0;CASQ=0.00;CQ=3420.67;EVENT=gridss176_6753;IC=0;IHOMPOS=0,0;IQ=0.00;PARID=gridss176_6753h;RAS=1;RASQ=1450.90;REF=2;REFPAIR=0;RP=36;RPQ=597.65;SC=1X396M99D140M;SR=20;SRQ=361.88;SVTYPE=BND        GT:ASQ:ASRP:ASSR:BAQ:BQ:BSC:BSCQ:BUM:BUMQ:CASQ:IC:IQ:QUAL:RASQ:REF:REFPAIR:RP:RPQ:SR:SRQ  .:1063.75:77:69:0.00:104.22:7:104.22:0:0.00:0.00:0:0.00:3474.18:1450.90:2:0:36:597.65:20:361.88
17      34000277        gridss225_169o  C       C]transgene:2033]       1247.20 .       AS=2;ASQ=669.03;ASRP=27;ASSR=23;BA=0;BAQ=0.00;BEID=asm225-18260,asm225-18261,asm273-1190;BQ=54.91;BSC=0;BSCQ=0.00;BUM=2;BUMQ=54.91;CAS=0;CASQ=0.00;CQ=1230.60;EVENT=gridss225_169;IC=0;IHOMPOS=0,48;IQ=0.00;PARID=gridss225_169h;RAS=1;RASQ=141.71;REF=20;REFPAIR=0;RP=21;RPQ=348.63;SC=533M1X;SR=5;SRQ=87.82;SVTYPE=BND        GT:ASQ:ASRP:ASSR:BAQ:BQ:BSC:BSCQ:BUM:BUMQ:CASQ:IC:IQ:QUAL:RASQ:REF:REFPAIR:RP:RPQ:SR:SRQ  .:669.03:27:23:0.00:54.91:0:0.00:2:54.91:0.00:0:0.00:1247.20:141.71:20:0:21:348.63:5:87.82
17      34000324        gridss225_170o  C       C]transgene:1986]       531.47  SINGLE_ASSEMBLY AS=0;ASQ=0.00;ASRP=26;ASSR=1;BA=0;BAQ=0.00;BEID=asm273-1189;BQ=132.98;BSC=9;BSCQ=132.98;BUM=0;BUMQ=0.00;CAS=0;CASQ=0.00;CQ=1145.19;EVENT=gridss225_170;IC=0;IHOMPOS=-48,0;IQ=0.00;PARID=gridss225_170h;RAS=1;RASQ=448.47;REF=0;REFPAIR=0;RP=5;RPQ=83.01;SC=483M1X;SR=0;SRQ=0.00;SVTYPE=BND      GT:ASQ:ASRP:ASSR:BAQ:BQ:BSC:BSCQ:BUM:BUMQ:CASQ:IC:IQ:QUAL:RASQ:REF:REFPAIR:RP:RPQ:SR:SRQ  .:0.00:26:1:0.00:132.98:9:132.98:0:0.00:0.00:0:0.00:531.47:448.47:0:0:5:83.01:0:0.00
17      34002304        gridss225_5474o T       [transgene:7[T  1410.46 .       AS=1;ASQ=563.58;ASRP=48;ASSR=15;BA=0;BAQ=0.00;BEID=asm225-67082,asm273-2052;BQ=129.97;BSC=7;BSCQ=102.52;BUM=1;BUMQ=27.46;CAS=0;CASQ=0.00;CQ=1410.46;EVENT=gridss225_5474;IC=0;IHOMPOS=0,0;IQ=0.00;PARID=gridss225_5474h;RAS=1;RASQ=496.29;REF=0;REFPAIR=0;RP=19;RPQ=315.43;SC=1X442M;SR=2;SRQ=35.17;SVTYPE=BND  GT:ASQ:ASRP:ASSR:BAQ:BQ:BSC:BSCQ:BUM:BUMQ:CASQ:IC:IQ:QUAL:RASQ:REF:REFPAIR:RP:RPQ:SR:SRQ  .:563.58:48:15:0.00:129.97:7:102.52:1:27.46:0.00:0:0.00:1410.46:496.29:0:0:19:315.43:2:35.17
transgene       5       gridss102_10775h        G       [7:28985940[G   481.85  LOW_QUAL        AS=1;ASQ=135.16;ASRP=10;ASSR=8;BA=0;BAQ=0.00;BEID=asm102-77226,asm273-2053;BQ=0.00;BSC=0;BSCQ=0.00;BUM=0;BUMQ=0.00;CAS=0;CASQ=0.00;CIPOS=-2,0;CIRPOS=0,2;CQ=481.85;EVENT=gridss102_10775;HOMLEN=2;HOMSEQ=GA;IC=0;IHOMPOS=-2,0;IQ=0.00;PARID=gridss102_10775o;RAS=1;RASQ=189.18;REF=0;REFPAIR=0;RP=6;RPQ=99.61;SC=1X1N1X399M;SR=3;SRQ=57.90;SVTYPE=BND     GT:ASQ:ASRP:ASSR:BAQ:BQ:BSC:BSCQ:BUM:BUMQ:CASQ:IC:IQ:QUAL:RASQ:REF:REFPAIR:RP:RPQ:SR:SRQ        .:135.16:10:8:0.00:0.00:0:0.00:0:0.00:0.00:0:0.00:481.85:189.18:0:0:6:99.61:3:57.90
transgene       7       gridss225_5474h A       [17:34002304[A  1410.46 .       AS=1;ASQ=496.29;ASRP=48;ASSR=15;BA=0;BAQ=0.00;BEID=asm225-67082,asm273-2052;BQ=194.86;BSC=4;BSCQ=57.58;BUM=5;BUMQ=137.28;CAS=0;CASQ=0.00;CQ=1410.46;EVENT=gridss225_5474;IC=0;IHOMPOS=0,0;IQ=0.00;PARID=gridss225_5474o;RAS=1;RASQ=563.58;REF=3;REFPAIR=0;RP=19;RPQ=315.43;SC=1X555M;SR=2;SRQ=35.17;SVTYPE=BND  GT:ASQ:ASRP:ASSR:BAQ:BQ:BSC:BSCQ:BUM:BUMQ:CASQ:IC:IQ:QUAL:RASQ:REF:REFPAIR:RP:RPQ:SR:SRQ  .:496.29:48:15:0.00:194.86:4:57.58:5:137.28:0.00:0:0.00:1410.46:563.58:3:0:19:315.43:2:35.17
transgene       1986    gridss225_170h  G       G]17:34000324]  531.47  SINGLE_ASSEMBLY AS=1;ASQ=448.47;ASRP=26;ASSR=1;BA=0;BAQ=0.00;BEID=asm273-1189;BQ=0.00;BSC=0;BSCQ=0.00;BUM=0;BUMQ=0.00;CAS=0;CASQ=0.00;CQ=1145.19;EVENT=gridss225_170;IC=0;IHOMPOS=0,48;IQ=0.00;PARID=gridss225_170o;RAS=0;RASQ=0.00;REF=38;REFPAIR=7;RP=5;RPQ=83.01;SC=288M1X;SR=0;SRQ=0.00;SVTYPE=BND  GT:ASQ:ASRP:ASSR:BAQ:BQ:BSC:BSCQ:BUM:BUMQ:CASQ:IC:IQ:QUAL:RASQ:REF:REFPAIR:RP:RPQ:SR:SRQ  .:448.47:26:1:0.00:0.00:0:0.00:0:0.00:0.00:0:0.00:531.47:0.00:38:7:5:83.01:0:0.00
transgene       2033    gridss225_169h  G       G]17:34000277]  1247.20 .       AS=1;ASQ=141.71;ASRP=27;ASSR=23;BA=0;BAQ=0.00;BEID=asm225-18260,asm225-18261,asm273-1190;BQ=70.72;BSC=5;BSCQ=70.72;BUM=0;BUMQ=0.00;CAS=0;CASQ=0.00;CQ=1230.60;EVENT=gridss225_169;IC=0;IHOMPOS=-48,0;IQ=0.00;PARID=gridss225_169o;RAS=2;RASQ=669.03;REF=22;REFPAIR=7;RP=21;RPQ=348.63;SC=472M1X;SR=5;SRQ=87.82;SVTYPE=BND       GT:ASQ:ASRP:ASSR:BAQ:BQ:BSC:BSCQ:BUM:BUMQ:CASQ:IC:IQ:QUAL:RASQ:REF:REFPAIR:RP:RPQ:SR:SRQ  .:141.71:27:23:0.00:70.72:5:70.72:0:0.00:0.00:0:0.00:1247.20:669.03:22:7:21:348.63:5:87.82
transgene       4643    gridss176_3143h A       ]12:113427284]A 3661.30 .       AS=1;ASQ=1217.11;ASRP=75;ASSR=69;BA=0;BAQ=0.00;BEID=asm176-20960,asm273-2055;BQ=70.33;BSC=5;BSCQ=70.33;BUM=0;BUMQ=0.00;CAS=0;CASQ=0.00;CIPOS=-1,0;CIRPOS=-1,0;CQ=3641.37;EVENT=gridss176_3143;HOMLEN=1;HOMSEQ=G;IC=0;IHOMPOS=-1,0;IQ=0.00;PARID=gridss176_3143o;RAS=1;RASQ=1280.81;REF=23;REFPAIR=16;RP=38;RPQ=630.86;SC=2X566M;SR=28;SRQ=532.52;SVTYPE=BND       GT:ASQ:ASRP:ASSR:BAQ:BQ:BSC:BSCQ:BUM:BUMQ:CASQ:IC:IQ:QUAL:RASQ:REF:REFPAIR:RP:RPQ:SR:SRQ        .:1217.11:75:69:0.00:70.33:5:70.33:0:0.00:0.00:0:0.00:3661.30:1280.81:23:16:38:630.86:28:532.52
transgene       6232    gridss103_684h  G       GACTTCAAGCTCCCGAATTCCT]7:36210528]      1317.41 .       AS=1;ASQ=166.01;ASRP=27;ASSR=16;BA=0;BAQ=0.00;BEID=asm103-15595,asm103-15624,asm273-1214;BQ=0.00;BSC=0;BSCQ=0.00;BUM=0;BUMQ=0.00;CAS=0;CASQ=0.00;CQ=1294.64;EVENT=gridss103_684;IC=0;IHOMPOS=21,-21;IQ=0.00;PARID=gridss103_684o;RAS=2;RASQ=598.95;REF=1;REFPAIR=0;RP=17;RPQ=282.22;SC=686M1X;SR=13;SRQ=270.22;SVTYPE=BNDGT:ASQ:ASRP:ASSR:BAQ:BQ:BSC:BSCQ:BUM:BUMQ:CASQ:IC:IQ:QUAL:RASQ:REF:REFPAIR:RP:RPQ:SR:SRQ .:166.01:27:16:0.00:0.00:0:0.00:0:0.00:0.00:0:0.00:1317.41:598.95:1:0:17:282.22:13:270.22
transgene       6232    gridss176_6753h G       G[12:113428877[ 3474.18 .       AS=1;ASQ=1450.90;ASRP=77;ASSR=69;BA=0;BAQ=0.00;BEID=asm176-67713,asm273-1195;BQ=220.50;BSC=9;BSCQ=138.13;BUM=3;BUMQ=82.37;CAS=0;CASQ=0.00;CQ=3420.67;EVENT=gridss176_6753;IC=0;IHOMPOS=0,0;IQ=0.00;PARID=gridss176_6753o;RAS=1;RASQ=1063.75;REF=1;REFPAIR=0;RP=36;RPQ=597.65;SC=388M1D151M1X;SR=20;SRQ=361.88;SVTYPE=BND        GT:ASQ:ASRP:ASSR:BAQ:BQ:BSC:BSCQ:BUM:BUMQ:CASQ:IC:IQ:QUAL:RASQ:REF:REFPAIR:RP:RPQ:SR:SRQ  .:1450.90:77:69:0.00:220.50:9:138.13:3:82.37:0.00:0:0.00:3474.18:1063.75:1:0:36:597.65:20:361.88

Just to mention in advance, the run was done with the config file and the added row in it variantcalling.callBreakends = true, so it also includes singleend breaks.

Again the same question - do I have here multiple possible insertion points?

thanks a lot for the help

yeroslaviz commented 6 years ago

Re-running the data and setting variantcalling.callBreakends = false gave me these results, which looks a bit better and more to what I was expecting. E.g. I was hoping for two insertion points in chr7 :-)

grep transgene trimmomatic.combined.masked.sorted.sv.vcf | grep -v LOW_QUAL
##contig=<ID=transgene,length=6252>
7       28985942        gridss102_6855o C       [transgene:5[C  537.16  .       AS=2;ASQ=212.64;ASRP=11;ASSR=6;BA=0;BAQ=0.00;BEID=asm102-21303,asm102-21312,asm273-801;BQ=49.49;BSC=2;BSCQ=49.49;BUM=0;BUMQ=0.00;CAS=0;CASQ=0.00;CIPOS=-2,0;CIRPOS=0,2;CQ=537.16;EVENT=gridss102_6855;HOMLEN=2;HOMSEQ=TC;IC=0;IHOMPOS=-1,0;IQ=0.00;PARID=gridss102_6855h;RAS=1;RASQ=161.37;REF=21;REFPAIR=18;RP=6;RPQ=111.67;SC=1X1N1X372M;SR=2;SRQ=51.47;SVTYPE=BND GT:ASQ:ASRP:ASSR:BAQ:BQ:BSC:BSCQ:BUM:BUMQ:CASQ:IC:IQ:QUAL:RASQ:REF:REFPAIR:RP:RPQ:SR:SRQ        .:212.64:11:6:0.00:49.49:2:49.49:0:0.00:0.00:0:0.00:537.16:161.37:21:18:6:111.67:2:51.47
7       36210528        gridss103_460o  G       GAGGAATTCGGGAGCTTGAAGT]transgene:6232]  1491.70 .       AS=1;ASQ=680.21;ASRP=22;ASSR=14;BA=0;BAQ=0.00;BEID=asm103-4518,asm273-429;BQ=73.46;BSC=3;BSCQ=73.46;BUM=0;BUMQ=0.00;CAS=0;CASQ=0.00;CQ=1491.70;EVENT=gridss103_460;IC=0;IHOMPOS=21,-21;IQ=0.00;PARID=gridss103_460h;RAS=1;RASQ=167.51;REF=32;REFPAIR=11;RP=15;RPQ=279.19;SC=151M98D356M1X;SR=11;SRQ=364.79;SVTYPE=BND         GT:ASQ:ASRP:ASSR:BAQ:BQ:BSC:BSCQ:BUM:BUMQ:CASQ:IC:IQ:QUAL:RASQ:REF:REFPAIR:RP:RPQ:SR:SRQ        .:680.21:22:14:0.00:73.46:3:73.46:0:0.00:0.00:0:0.00:1491.70:167.51:32:11:15:279.19:11:364.79
12      113427284       gridss176_2066o G       G[transgene:4643[       4587.64 .       AS=1;ASQ=1633.33;ASRP=72;ASSR=61;BA=0;BAQ=0.00;BEID=asm176-5709,asm273-803;BQ=190.16;BSC=8;BSCQ=190.16;BUM=0;BUMQ=0.00;CAS=0;CASQ=0.00;CIPOS=-1,0;CIRPOS=-1,0;CQ=4587.64;EVENT=gridss176_2066;HOMLEN=1;HOMSEQ=G;IC=0;IHOMPOS=-1,0;IQ=0.00;PARID=gridss176_2066h;RAS=1;RASQ=1511.15;REF=0;REFPAIR=0;RP=36;RPQ=670.05;SC=341M2X;SR=25;SRQ=773.12;SVTYPE=BND    GT:ASQ:ASRP:ASSR:BAQ:BQ:BSC:BSCQ:BUM:BUMQ:CASQ:IC:IQ:QUAL:RASQ:REF:REFPAIR:RP:RPQ:SR:SRQ        .:1633.33:72:61:0.00:190.16:8:190.16:0:0.00:0.00:0:0.00:4587.64:1511.15:0:0:36:670.05:25:773.12
12      113428877       gridss176_4432o A       ]transgene:6232]A       4062.86 .       AS=1;ASQ=1253.83;ASRP=70;ASSR=57;BA=0;BAQ=0.00;BEID=asm176-19980,asm273-425;BQ=144.21;BSC=6;BSCQ=144.21;BUM=0;BUMQ=0.00;CAS=0;CASQ=0.00;CQ=3996.24;EVENT=gridss176_4432;IC=0;IHOMPOS=0,0;IQ=0.00;PARID=gridss176_4432h;RAS=1;RASQ=1699.40;REF=2;REFPAIR=0;RP=33;RPQ=614.21;SC=1X381M174D80M;SR=17;SRQ=495.42;SVTYPE=BND GT:ASQ:ASRP:ASSR:BAQ:BQ:BSC:BSCQ:BUM:BUMQ:CASQ:IC:IQ:QUAL:RASQ:REF:REFPAIR:RP:RPQ:SR:SRQ     .:1253.83:70:57:0.00:144.21:6:144.21:0:0.00:0.00:0:0.00:4062.86:1699.40:2:0:33:614.21:17:495.42
17      34000277        gridss225_105o  C       C]transgene:2033]       1342.59 .       AS=2;ASQ=744.76;ASRP=24;ASSR=19;BA=0;BAQ=0.00;BEID=asm225-5027,asm225-5028,asm273-422;BQ=0.00;BSC=0;BSCQ=0.00;BUM=0;BUMQ=0.00;CAS=0;CASQ=0.00;CQ=1342.59;EVENT=gridss225_105;IC=0;IHOMPOS=0,48;IQ=0.00;PARID=gridss225_105h;RAS=1;RASQ=172.58;REF=18;REFPAIR=0;RP=17;RPQ=309.80;SC=436M1X;SR=4;SRQ=115.45;SVTYPE=BND    GT:ASQ:ASRP:ASSR:BAQ:BQ:BSC:BSCQ:BUM:BUMQ:CASQ:IC:IQ:QUAL:RASQ:REF:REFPAIR:RP:RPQ:SR:SRQ     .:744.76:24:19:0.00:0.00:0:0.00:0:0.00:0.00:0:0.00:1342.59:172.58:18:0:17:309.80:4:115.45
17      34000324        gridss225_106o  C       C]transgene:1986]       579.44  SINGLE_ASSEMBLY AS=0;ASQ=0.00;ASRP=24;ASSR=1;BA=0;BAQ=0.00;BEID=asm273-421;BQ=182.61;BSC=8;BSCQ=182.61;BUM=0;BUMQ=0.00;CAS=0;CASQ=0.00;CQ=1324.86;EVENT=gridss225_106;IC=0;IHOMPOS=-48,0;IQ=0.00;PARID=gridss225_106h;RAS=1;RASQ=467.77;REF=0;REFPAIR=0;RP=6;RPQ=111.67;SC=483M1X;SR=0;SRQ=0.00;SVTYPE=BND      GT:ASQ:ASRP:ASSR:BAQ:BQ:BSC:BSCQ:BUM:BUMQ:CASQ:IC:IQ:QUAL:RASQ:REF:REFPAIR:RP:RPQ:SR:SRQ     .:0.00:24:1:0.00:182.61:8:182.61:0:0.00:0.00:0:0.00:579.44:467.77:0:0:6:111.67:0:0.00
17      34002304        gridss225_3419o T       [transgene:7[T  1634.38 .       AS=1;ASQ=668.18;ASRP=47;ASSR=14;BA=0;BAQ=0.00;BEID=asm225-19363,asm273-800;BQ=174.83;BSC=6;BSCQ=139.38;BUM=1;BUMQ=35.45;CAS=0;CASQ=0.00;CQ=1634.38;EVENT=gridss225_3419;IC=0;IHOMPOS=0,0;IQ=0.00;PARID=gridss225_3419h;RAS=1;RASQ=562.31;REF=0;REFPAIR=0;RP=19;RPQ=350.02;SC=1X416M;SR=2;SRQ=53.87;SVTYPE=BND   GT:ASQ:ASRP:ASSR:BAQ:BQ:BSC:BSCQ:BUM:BUMQ:CASQ:IC:IQ:QUAL:RASQ:REF:REFPAIR:RP:RPQ:SR:SRQ     .:668.18:47:14:0.00:174.83:6:139.38:1:35.45:0.00:0:0.00:1634.38:562.31:0:0:19:350.02:2:53.87
transgene       5       gridss102_6855h G       [7:28985942[G   537.16  .       AS=1;ASQ=161.37;ASRP=11;ASSR=6;BA=0;BAQ=0.00;BEID=asm102-21303,asm102-21312,asm273-801;BQ=0.00;BSC=0;BSCQ=0.00;BUM=0;BUMQ=0.00;CAS=0;CASQ=0.00;CIPOS=0,2;CIRPOS=-2,0;CQ=537.16;EVENT=gridss102_6855;HOMLEN=2;HOMSEQ=GA;IC=0;IHOMPOS=0,1;IQ=0.00;PARID=gridss102_6855o;RAS=2;RASQ=212.64;REF=2;REFPAIR=0;RP=6;RPQ=111.67;SC=1X1N1X243M3D171M;SR=2;SRQ=51.47;SVTYPE=BND        GT:ASQ:ASRP:ASSR:BAQ:BQ:BSC:BSCQ:BUM:BUMQ:CASQ:IC:IQ:QUAL:RASQ:REF:REFPAIR:RP:RPQ:SR:SRQ        .:161.37:11:6:0.00:0.00:0:0.00:0:0.00:0.00:0:0.00:537.16:212.64:2:0:6:111.67:2:51.47
transgene       7       gridss225_3419h A       [17:34002304[A  1634.38 .       AS=1;ASQ=562.31;ASRP=47;ASSR=14;BA=0;BAQ=0.00;BEID=asm225-19363,asm273-800;BQ=169.86;BSC=3;BSCQ=63.52;BUM=3;BUMQ=106.34;CAS=0;CASQ=0.00;CQ=1634.38;EVENT=gridss225_3419;IC=0;IHOMPOS=0,0;IQ=0.00;PARID=gridss225_3419o;RAS=1;RASQ=668.18;REF=2;REFPAIR=0;RP=19;RPQ=350.02;SC=1X515M;SR=2;SRQ=53.87;SVTYPE=BND   GT:ASQ:ASRP:ASSR:BAQ:BQ:BSC:BSCQ:BUM:BUMQ:CASQ:IC:IQ:QUAL:RASQ:REF:REFPAIR:RP:RPQ:SR:SRQ     .:562.31:47:14:0.00:169.86:3:63.52:3:106.34:0.00:0:0.00:1634.38:668.18:2:0:19:350.02:2:53.87
transgene       1986    gridss225_106h  G       G]17:34000324]  579.44  SINGLE_ASSEMBLY AS=1;ASQ=467.77;ASRP=24;ASSR=1;BA=0;BAQ=0.00;BEID=asm273-421;BQ=0.00;BSC=0;BSCQ=0.00;BUM=0;BUMQ=0.00;CAS=0;CASQ=0.00;CQ=1324.86;EVENT=gridss225_106;IC=0;IHOMPOS=0,48;IQ=0.00;PARID=gridss225_106o;RAS=0;RASQ=0.00;REF=29;REFPAIR=7;RP=6;RPQ=111.67;SC=288M1X;SR=0;SRQ=0.00;SVTYPE=BND  GT:ASQ:ASRP:ASSR:BAQ:BQ:BSC:BSCQ:BUM:BUMQ:CASQ:IC:IQ:QUAL:RASQ:REF:REFPAIR:RP:RPQ:SR:SRQ     .:467.77:24:1:0.00:0.00:0:0.00:0:0.00:0.00:0:0.00:579.44:0.00:29:7:6:111.67:0:0.00
transgene       2033    gridss225_105h  G       G]17:34000277]  1342.59 .       AS=1;ASQ=172.58;ASRP=24;ASSR=19;BA=0;BAQ=0.00;BEID=asm225-5027,asm225-5028,asm273-422;BQ=84.82;BSC=4;BSCQ=84.82;BUM=0;BUMQ=0.00;CAS=0;CASQ=0.00;CQ=1342.59;EVENT=gridss225_105;IC=0;IHOMPOS=-48,0;IQ=0.00;PARID=gridss225_105o;RAS=2;RASQ=744.76;REF=13;REFPAIR=11;RP=17;RPQ=309.80;SC=411M1X;SR=4;SRQ=115.45;SVTYPE=BND        GT:ASQ:ASRP:ASSR:BAQ:BQ:BSC:BSCQ:BUM:BUMQ:CASQ:IC:IQ:QUAL:RASQ:REF:REFPAIR:RP:RPQ:SR:SRQ     .:172.58:24:19:0.00:84.82:4:84.82:0:0.00:0.00:0:0.00:1342.59:744.76:13:11:17:309.80:4:115.45
transgene       4643    gridss176_2066h A       ]12:113427284]A 4587.64 .       AS=1;ASQ=1511.15;ASRP=72;ASSR=61;BA=0;BAQ=0.00;BEID=asm176-5709,asm273-803;BQ=67.98;BSC=3;BSCQ=67.98;BUM=0;BUMQ=0.00;CAS=0;CASQ=0.00;CIPOS=-1,0;CIRPOS=-1,0;CQ=4587.64;EVENT=gridss176_2066;HOMLEN=1;HOMSEQ=G;IC=0;IHOMPOS=-1,0;IQ=0.00;PARID=gridss176_2066o;RAS=1;RASQ=1633.33;REF=20;REFPAIR=16;RP=36;RPQ=670.05;SC=2X566M;SR=25;SRQ=773.12;SVTYPE=BND    GT:ASQ:ASRP:ASSR:BAQ:BQ:BSC:BSCQ:BUM:BUMQ:CASQ:IC:IQ:QUAL:RASQ:REF:REFPAIR:RP:RPQ:SR:SRQ        .:1511.15:72:61:0.00:67.98:3:67.98:0:0.00:0.00:0:0.00:4587.64:1633.33:20:16:36:670.05:25:773.12
transgene       6232    gridss103_460h  G       GACTTCAAGCTCCCGAATTCCT]7:36210528]      1491.70 .       AS=1;ASQ=167.51;ASRP=22;ASSR=14;BA=0;BAQ=0.00;BEID=asm103-4518,asm273-429;BQ=0.00;BSC=0;BSCQ=0.00;BUM=0;BUMQ=0.00;CAS=0;CASQ=0.00;CQ=1491.70;EVENT=gridss103_460;IC=0;IHOMPOS=21,-21;IQ=0.00;PARID=gridss103_460o;RAS=1;RASQ=680.21;REF=1;REFPAIR=0;RP=15;RPQ=279.19;SC=151M8D447M1X;SR=11;SRQ=364.79;SVTYPE=BND    GT:ASQ:ASRP:ASSR:BAQ:BQ:BSC:BSCQ:BUM:BUMQ:CASQ:IC:IQ:QUAL:RASQ:REF:REFPAIR:RP:RPQ:SR:SRQ .:167.51:22:14:0.00:0.00:0:0.00:0:0.00:0.00:0:0.00:1491.70:680.21:1:0:15:279.19:11:364.79
transgene       6232    gridss176_4432h G       G[12:113428877[ 4062.86 .       AS=1;ASQ=1699.40;ASRP=70;ASSR=57;BA=0;BAQ=0.00;BEID=asm176-19980,asm273-425;BQ=221.43;BSC=6;BSCQ=150.53;BUM=2;BUMQ=70.90;CAS=0;CASQ=0.00;CQ=3996.24;EVENT=gridss176_4432;IC=0;IHOMPOS=0,0;IQ=0.00;PARID=gridss176_4432o;RAS=1;RASQ=1253.83;REF=1;REFPAIR=0;RP=33;RPQ=614.21;SC=443M1D151M1X;SR=17;SRQ=495.42;SVTYPE=BND GT:ASQ:ASRP:ASSR:BAQ:BQ:BSC:BSCQ:BUM:BUMQ:CASQ:IC:IQ:QUAL:RASQ:REF:REFPAIR:RP:RPQ:SR:SRQ     .:1699.40:70:57:0.00:221.43:6:150.53:2:70.90:0.00:0:0.00:4062.86:1253.83:1:0:33:614.21:17:495.42
d-cameron commented 6 years ago

I can see it looks better now than before, but I am not really sure why I see four contigs now on each side. Are these multiple possible insertion sites?

Since GRIDSS does breakend assembly, not breakpoint assembly, each clean breakpoint between A and B will have an assembly at A that also maps across to B and an assembly at B that also maps back to A. When viewing the assembly.sv.bam, the contigs will have a chimeric alignment and have split read alignment records at both A and B. Since there are two of them, you'll see 2 assemblies at A and 2 assemblies at B.

So, you've got insertions into chr12, and chr17. Presumably these are the masked regions so they won't be real transgene insertions, these will just be the locations of the mouse homolog.

Again the same question - do I have here multiple possible insertion points?

Doesn't look like it. You have two different insertion locations on chr7, but they're not very close to each other so it doesn't look like a simple clean insertion. You have:

7:28985942 -> transgene -> (untemplated inserted sequence of ACTTCAAGCTCCCGAATTCCT) ->7:36210528

BLASTing the assemblies in question gives nice clean alignments to those location so we quite good confidence in that part of the transgenic insertion. The question is: what happened around that? Have a look at the copy number on chr7. Did you actually lose 7Mb of chr7 where the transgene got inserted (that'd be obvious from even a .tdf coverage file generated using igvtools)? Are there any copy number transitions near the transgene chr7 breakpoint sites? Maybe there's a small extra fragment that got inserted with the transgene. Eg. maybe a few hundred/thousand bases of 7:28985942 sequence got inserted with the transgene at the 7:36210528 site (or maybe the opposite). Are there any breakpoints that go from somewhere near 7:28985942 to somewhere near 7:36210528?

yeroslaviz commented 6 years ago

Would this mean, that these four rows are showing one breakend assembly with an extra sequence inserted?

7       28985942        gridss102_6855o C       [transgene:5[C  537.16  .       AS=2;ASQ=212.64;ASRP=11;ASSR=6;BA=0;BAQ=0.00;BEID=asm102-21303,asm102-21312,asm273-801;BQ=49.49;BSC=2;BSCQ=49.49;BUM=0;BUMQ=0.00;CAS=0;CASQ=0.00;CIPOS=-2,0;CIRPOS=0,2;CQ=537.16;EVENT=gridss102_6855;HOMLEN=2;HOMSEQ=TC;IC=0;IHOMPOS=-1,0;IQ=0.00;PARID=gridss102_6855h;RAS=1;RASQ=161.37;REF=21;REFPAIR=18;RP=6;RPQ=111.67;SC=1X1N1X372M;SR=2;SRQ=51.47;SVTYPE=BND GT:ASQ:ASRP:ASSR:BAQ:BQ:BSC:BSCQ:BUM:BUMQ:CASQ:IC:IQ:QUAL:RASQ:REF:REFPAIR:RP:RPQ:SR:SRQ        .:212.64:11:6:0.00:49.49:2:49.49:0:0.00:0.00:0:0.00:537.16:161.37:21:18:6:111.67:2:51.47
7       36210528        gridss103_460o  G       GAGGAATTCGGGAGCTTGAAGT]transgene:6232]  1491.70 .       AS=1;ASQ=680.21;ASRP=22;ASSR=14;BA=0;BAQ=0.00;BEID=asm103-4518,asm273-429;BQ=73.46;BSC=3;BSCQ=73.46;BUM=0;BUMQ=0.00;CAS=0;CASQ=0.00;CQ=1491.70;EVENT=gridss103_460;IC=0;IHOMPOS=21,-21;IQ=0.00;PARID=gridss103_460h;RAS=1;RASQ=167.51;REF=32;REFPAIR=11;RP=15;RPQ=279.19;SC=151M98D356M1X;SR=11;SRQ=364.79;SVTYPE=BND         GT:ASQ:ASRP:ASSR:BAQ:BQ:BSC:BSCQ:BUM:BUMQ:CASQ:IC:IQ:QUAL:RASQ:REF:REFPAIR:RP:RPQ:SR:SRQ        .:680.21:22:14:0.00:73.46:3:73.46:0:0.00:0.00:0:0.00:1491.70:167.51:32:11:15:279.19:11:364.79
transgene       5       gridss102_6855h G       [7:28985942[G   537.16  .       AS=1;ASQ=161.37;ASRP=11;ASSR=6;BA=0;BAQ=0.00;BEID=asm102-21303,asm102-21312,asm273-801;BQ=0.00;BSC=0;BSCQ=0.00;BUM=0;BUMQ=0.00;CAS=0;CASQ=0.00;CIPOS=0,2;CIRPOS=-2,0;CQ=537.16;EVENT=gridss102_6855;HOMLEN=2;HOMSEQ=GA;IC=0;IHOMPOS=0,1;IQ=0.00;PARID=gridss102_6855o;RAS=2;RASQ=212.64;REF=2;REFPAIR=0;RP=6;RPQ=111.67;SC=1X1N1X243M3D171M;SR=2;SRQ=51.47;SVTYPE=BND        GT:ASQ:ASRP:ASSR:BAQ:BQ:BSC:BSCQ:BUM:BUMQ:CASQ:IC:IQ:QUAL:RASQ:REF:REFPAIR:RP:RPQ:SR:SRQ        .:161.37:11:6:0.00:0.00:0:0.00:0:0.00:0.00:0:0.00:537.16:212.64:2:0:6:111.67:2:51.47
transgene       6232    gridss103_460h  G       GACTTCAAGCTCCCGAATTCCT]7:36210528]      1491.70 .       AS=1;ASQ=167.51;ASRP=22;ASSR=14;BA=0;BAQ=0.00;BEID=asm103-4518,asm273-429;BQ=0.00;BSC=0;BSCQ=0.00;BUM=0;BUMQ=0.00;CAS=0;CASQ=0.00;CQ=1491.70;EVENT=gridss103_460;IC=0;IHOMPOS=21,-21;IQ=0.00;PARID=gridss103_460o;RAS=1;RASQ=680.21;REF=1;REFPAIR=0;RP=15;RPQ=279.19;SC=151M8D447M1X;SR=11;SRQ=364.79;SVTYPE=BND    GT:ASQ:ASRP:ASSR:BAQ:BQ:BSC:BSCQ:BUM:BUMQ:CASQ:IC:IQ:QUAL:RASQ:REF:REFPAIR:RP:RPQ:SR:SRQ .:167.51:22:14:0.00:0.00:0:0.00:0:0.00:0.00:0:0.00:1491.70:680.21:1:0:15:279.19:11:364.79

BLASTing the assemblies in question gives nice clean alignments to those location so we quite good confidence in that part of the transgenic insertion.

Are we talking here about chr12 and chr17? Or which assemblies?

To be honest, I'm not really sure what you mean here :-( It is all still quite confusing.

The question is: what happened around that? Have a look at the copy number on chr7. Did you actually lose 7Mb of chr7 where the transgene got inserted (that'd be obvious from even a .tdf coverage file generated using igvtools)?

To try and see this I'm converting the bam file to wig. When this is done I will use igvtools to create the tdf file and post the results here too.

Are there any copy number transitions near the transgene chr7 breakpoint sites? Maybe there's a small extra fragment that got inserted with the transgene. Eg. maybe a few hundred/thousand bases of 7:28985942 sequence got inserted with the transgene at the 7:36210528 site (or maybe the opposite).

Would this be visible in the gridss vcf result file? and if so, where can I see it? If it is in the same vcf file, than I don't think there are any. I pasted a snapshot of the two regions mentioned below. As far I can tell, the next event identified next to the possible transgene insertion position is another BND, but quite far off. I was inclined to say no, but than I took a look at all the hits identified in chr7. Here is a list of all hits between positions 7:28985942-36210528.

7   28985942    gridss102_6855o C   [transgene:5[C  537.16  .   AS=2;ASQ=212.64;ASRP=11;ASSR=6;BA=0;BAQ=0.00;BEID=asm102-21303,asm102-21312,asm273-801;BQ=49.49;BSC=2;BSCQ=49.49;BUM=0;BUMQ=0.00;CAS=0;CASQ=0.00;CIPOS=-2,0;CIRPOS=0,2;CQ=537.16;EVENT=gridss102_6855;HOMLEN=2;HOMSEQ=TC;IC=0;IHOMPOS=-1,0;IQ=0.00;PARID=gridss102_6855h;RAS=1;RASQ=161.37;REF=21;REFPAIR=18;RP=6;RPQ=111.67;SC=1X1N1X372M;SR=2;SRQ=51.47;SVTYPE=BND    GT:ASQ:ASRP:ASSR:BAQ:BQ:BSC:BSCQ:BUM:BUMQ:CASQ:IC:IQ:QUAL:RASQ:REF:REFPAIR:RP:RPQ:SR:SRQ    .:212.64:11:6:0.00:49.49:2:49.49:0:0.00:0.00:0:0.00:537.16:161.37:21:18:6:111.67:2:51.47    
7   29002710    gridss102_3550o T   T[7:36181051[   2309.2  .   AS=1;ASQ=786.99;ASRP=34;ASSR=31;BA=0;BAQ=0.00;BEID=asm102-8451,asm103-18360;BQ=69.27;BSC=3;BSCQ=69.27;BUM=0;BUMQ=0.00;CAS=0;CASQ=0.00;CIPOS=-2,0;CIRPOS=-2,0;CQ=2309.20;EVENT=gridss102_3550;HOMLEN=2;HOMSEQ=AT;IC=0;IHOMPOS=-2,0;IQ=0.00;PARID=gridss102_3550h;RAS=1;RASQ=761.91;REF=31;REFPAIR=18;RP=18;RPQ=335.02;SC=117M46D340M1X1N1X;SR=14;SRQ=425.28;SVTYPE=BND   GT:ASQ:ASRP:ASSR:BAQ:BQ:BSC:BSCQ:BUM:BUMQ:CASQ:IC:IQ:QUAL:RASQ:REF:REFPAIR:RP:RPQ:SR:SRQ    .:786.99:34:31:0.00:69.27:3:69.27:0:0.00:0.00:0:0.00:2309.20:761.91:31:18:18:335.02:14:425.28   
7   31382711    gridss102_5924o G   ]7:31386712]G   1356.06 .   AS=1;ASQ=351.86;ASRP=49;ASSR=0;BA=0;BAQ=0.00;BEID=asm102-11832,asm102-24636;BQ=35.45;BSC=0;BSCQ=0.00;BUM=1;BUMQ=35.45;CAS=0;CASQ=0.00;CIPOS=-90,0;CIRPOS=-90,0;CQ=1356.06;EVENT=gridss102_5924;HOMLEN=90;HOMSEQ=TCTGTTGAGGGGCATCTAGGTTCTTTCCAGCTTCTGGCTATTATAAATAAGGCTGCTATGAACATAGTGGAGCATGTGTCCTTCTTACCA;IC=0;IHOMPOS=-284,158;IQ=0.00;PARID=gridss102_5924h;RAS=1;RASQ=799.46;REF=28;REFPAIR=11;RP=11;RPQ=204.74;SC=1X89N1X357M;SR=0;SRQ=0.00;SVTYPE=BND GT:ASQ:ASRP:ASSR:BAQ:BQ:BSC:BSCQ:BUM:BUMQ:CASQ:IC:IQ:QUAL:RASQ:REF:REFPAIR:RP:RPQ:SR:SRQ    .:351.86:49:0:0.00:35.45:0:0.00:1:35.45:0.00:0:0.00:1356.06:799.46:28:11:11:204.74:0:0.00   
7   31386712    gridss102_5924h A   A[7:31382711[   1356.06 .   AS=1;ASQ=799.46;ASRP=49;ASSR=0;BA=0;BAQ=0.00;BEID=asm102-11832,asm102-24636;BQ=496.28;BSC=0;BSCQ=0.00;BUM=14;BUMQ=496.28;CAS=0;CASQ=0.00;CIPOS=-90,0;CIRPOS=-90,0;CQ=1356.06;EVENT=gridss102_5924;HOMLEN=90;HOMSEQ=TCTGTTGAGGGGCATCTAGGTTCTTTCCAGCTTCTGGCTATTATAAATAAGGCTGCTATGAACATAGTGGAGCATGTGTCCTTCTTACCA;IC=0;IHOMPOS=-284,158;IQ=0.00;PARID=gridss102_5924o;RAS=1;RASQ=351.86;REF=30;REFPAIR=15;RP=11;RPQ=204.74;SC=460M1X89N1X;SR=0;SRQ=0.00;SVTYPE=BND  GT:ASQ:ASRP:ASSR:BAQ:BQ:BSC:BSCQ:BUM:BUMQ:CASQ:IC:IQ:QUAL:RASQ:REF:REFPAIR:RP:RPQ:SR:SRQ    .:799.46:49:0:0.00:496.28:0:0.00:14:496.28:0.00:0:0.00:1356.06:351.86:30:15:11:204.74:0:0.00    
7   35635809    gridss103_5728o T   ]7:35635921]T   511.86  .   AS=1;ASQ=331.14;ASRP=0;ASSR=20;BA=0;BAQ=0.00;BEID=asm103-17508,asm103-3635;BQ=152.34;BSC=7;BSCQ=152.34;BUM=0;BUMQ=0.00;CAS=0;CASQ=0.00;CIPOS=0,11;CIRPOS=0,11;CQ=511.86;EVENT=gridss103_5728;HOMLEN=11;HOMSEQ=TGTGTGTGTGT;IC=0;IHOMPOS=-28,65;IQ=0.00;PARID=gridss103_5728h;RAS=1;RASQ=152.00;REF=30;REFPAIR=11;RP=0;RPQ=0.00;SC=1X10N1X88M;SR=1;SRQ=28.73;SVTYPE=BND   GT:ASQ:ASRP:ASSR:BAQ:BQ:BSC:BSCQ:BUM:BUMQ:CASQ:IC:IQ:QUAL:RASQ:REF:REFPAIR:RP:RPQ:SR:SRQ    .:331.14:0:20:0.00:152.34:7:152.34:0:0.00:0.00:0:0.00:511.86:152.00:30:11:0:0.00:1:28.73    
7   35635921    gridss103_5728h G   G[7:35635809[   511.86  .   AS=1;ASQ=152.00;ASRP=0;ASSR=20;BA=0;BAQ=0.00;BEID=asm103-17508,asm103-3635;BQ=91.26;BSC=4;BSCQ=91.26;BUM=0;BUMQ=0.00;CAS=0;CASQ=0.00;CIPOS=0,11;CIRPOS=0,11;CQ=511.86;EVENT=gridss103_5728;HOMLEN=11;HOMSEQ=TGTGTGTGTGT;IC=0;IHOMPOS=-28,65;IQ=0.00;PARID=gridss103_5728o;RAS=1;RASQ=331.14;REF=7;REFPAIR=14;RP=0;RPQ=0.00;SC=88M1X10N1X;SR=1;SRQ=28.73;SVTYPE=BND  GT:ASQ:ASRP:ASSR:BAQ:BQ:BSC:BSCQ:BUM:BUMQ:CASQ:IC:IQ:QUAL:RASQ:REF:REFPAIR:RP:RPQ:SR:SRQ    .:152.00:0:20:0.00:91.26:4:91.26:0:0.00:0.00:0:0.00:511.86:331.14:7:14:0:0.00:1:28.73   
7   36181051    gridss102_3550h G   ]7:29002710]G   2309.2  .   AS=1;ASQ=761.91;ASRP=34;ASSR=31;BA=0;BAQ=0.00;BEID=asm102-8451,asm103-18360;BQ=20.22;BSC=1;BSCQ=20.22;BUM=0;BUMQ=0.00;CAS=0;CASQ=0.00;CIPOS=-2,0;CIRPOS=-2,0;CQ=2309.20;EVENT=gridss102_3550;HOMLEN=2;HOMSEQ=AT;IC=0;IHOMPOS=-2,0;IQ=0.00;PARID=gridss102_3550o;RAS=1;RASQ=786.99;REF=46;REFPAIR=13;RP=18;RPQ=335.02;SC=1X1N1X547M;SR=14;SRQ=425.28;SVTYPE=BND  GT:ASQ:ASRP:ASSR:BAQ:BQ:BSC:BSCQ:BUM:BUMQ:CASQ:IC:IQ:QUAL:RASQ:REF:REFPAIR:RP:RPQ:SR:SRQ    .:761.91:34:31:0.00:20.22:1:20.22:0:0.00:0.00:0:0.00:2309.20:786.99:46:13:18:335.02:14:425.28   
7   36210528    gridss103_460o  G   GAGGAATTCGGGAGCTTGAAGT]transgene:6232]  1491.7  .   AS=1;ASQ=680.21;ASRP=22;ASSR=14;BA=0;BAQ=0.00;BEID=asm103-4518,asm273-429;BQ=73.46;BSC=3;BSCQ=73.46;BUM=0;BUMQ=0.00;CAS=0;CASQ=0.00;CQ=1491.70;EVENT=gridss103_460;IC=0;IHOMPOS=21,-21;IQ=0.00;PARID=gridss103_460h;RAS=1;RASQ=167.51;REF=32;REFPAIR=11;RP=15;RPQ=279.19;SC=151M98D356M1X;SR=11;SRQ=364.79;SVTYPE=BND   GT:ASQ:ASRP:ASSR:BAQ:BQ:BSC:BSCQ:BUM:BUMQ:CASQ:IC:IQ:QUAL:RASQ:REF:REFPAIR:RP:RPQ:SR:SRQ    .:680.21:22:14:0.00:73.46:3:73.46:0:0.00:0.00:0:0.00:1491.70:167.51:32:11:15:279.19:11:364.79   

The first and last rows are the breakends with the transgene (the complete list can be found here). If I can interpret the results correctly, the event gridss102_3550 span almost across all the region between the two insertion positions.

Are there any breakpoints that go from somewhere near 7:28985942 to somewhere near 7:36210528?

7:28985942

7 28985942

7:36210528

7 36210528

I hope you can make more sense in it than I can. I am not even sure where to start for now. thanks a lot for all the help.

d-cameron commented 6 years ago

Are we talking here about chr12 and chr17? Or which assemblies? To be honest, I'm not really sure what you mean here :-( It is all still quite confusing.

The chr7 ones. I got the breakpoint contig sequences from the read information in the file you attached, BLASTed them to the mouse genome and confirmed that there was not any ambiguity in the insertion sites.

Would this be visible in the gridss vcf result file? and if so, where can I see it?

No, GRIDSS doesn't to CN calling (yet). You'll need to use a CNV calling tool for that.

You don't have a clean transgenic insertion, you're now at the point that you need work out what else got rearranged when the transgene got inserted.

7:28985942

That's a pretty clear copy number change between those breakpoints there.You have the DNA fragment 7:28985942-7:29002710 inserted with your transgene (connected to the start of the transgene).

Following gridss102_3550 from 7:29002710 leads to 7:36181051 and continuing on to the right until we get to 7:36210528

So: consider the following DNA fragments:

A: 7:start of chr to 28985941 B: 7:28985942-29002710 C: 7:29002711-36181050 D: 7:36181051-36210528 E: 7:36210529 to end of chr T: AGGAATTCGGGAGCTTGAAGT followed by transgene:6232-1 (note that transgene transcript is on the negative strand)

There are multiple possible transgene insertion configurations. You have either:

a) A circular double minute BDT

b) ABDTBCDE

c) ABCDTBDE

There's no way to tell which one it is from this data since we can't phase the three breakpoints involved in the insertion (gridss102_6855, gridss102_3550, gridss103_460).

All scenarios results in a copy number increase for DNA segments B and D. You should confirm this is the case with the .tdf you generate (you can generate one directly from the BAM file) or a CNV tool.

I hope you can make more sense in it than I can. I am not even sure where to start for now.

I recommend drawing out the DNA segments and breakpoint connections.

thanks a lot for all the help.

Not a problem. Always happy to be credited with helping with the analysis ;)