Closed yeroslaviz closed 6 years ago
You mentioned Socrates for this kind of analysis, but is it possible to extract the same results from the gridss analysis.
Yes. The Socrates paper included a transgene analysis example. We repeated this example for GRIDSS and identified all transgene-related breakpoints. GRIDSS is preferable for this sort of analysis as it will give you longer breakpoint assemblies.
The transgene size is approx. 6.2Mb. Does it make sense to look for Indels of this size in the vcf file?
No. The only way you'll get a 6.2Mb insert called from short read sequence data is from de novo assembly and event then, you'd need to be lucky to get a contig that long.
For transgene insertion site identification you want to:
Insertion sites can then be identified by looking for inter-chromosomal variants between the transgene and the reference genome.
For transgene insertion site identification you want to:
- add transgene sequence as an additional record in your fasta file
- align to the combined fasta containing reference + transgene
- call structural variants
This I have done with the gridss workflow. Now I have a lot of results files. I am sorry to say, that I can't understand the workflow in the README.md file to understand how to sieve through the results and identify what i am looking for. I have run the script
#!/bin/bash
#
# Example gridss pipeline for single sample analysis
#
INPUT=../bwa.mapped/trimmomatic.combined.sorted.bam
#BLACKLIST=wgEncodeDacMapabilityConsensusExcludable.bed
REFERENCE=/home/yeroslaviz/projects/Guru/WGS/bwa.Indexed/combined/Combined.fa
OUTPUT=${INPUT/.bam/.sv.vcf}
ASSEMBLY=${OUTPUT/.sv.vcf/.gridss.assembly.bam}
GRIDSS_JAR=~/software/gridss-1.5.1-jar-with-dependencies.jar
if [[ ! -f "$INPUT" ]] ; then
echo "Missing $INPUT input file."
exit 1
fi
if ! which bwa >/dev/null 2>&1 ; then
echo "Missing bwa. Please add to PATH"
exit 1
fi
if [[ ! -f "$REFERENCE" ]] ; then
echo "Missing reference genome $REFERENCE. Update the REFERENCE variable in the shell script to your hg19 location"
echo "For real data, please ensure that all BAM files are aligned to the same reference, and the reference supplied to GRIDSS matches that used for alignment."
exit 1
fi
if [[ ! -f "$REFERENCE.bwt" ]] ; then
echo "Missing bwa index for $REFERENCE. Could not find $REFERENCE.bwt. Create a bwa index (using \"bwa index $REFERENCE\") or symlink the index files to the expected file names."
exit 1
fi
if [[ ! -f $GRIDSS_JAR ]] ; then
echo "Missing $GRIDSS_JAR. Update the GRIDSS_JAR variable in the shell script to your location"
exit 1
fi
if ! which java >/dev/null 2>&1 ; then
echo "Missing java. Please add java 1.8 or later to PATH"
exit 1
fi
JAVA_VERSION="$(java -version 2>&1 | grep version )"
if [[ ! "$JAVA_VERSION" =~ "\"1.8" ]] ; then
echo "Detected $JAVA_VERSION. GRIDSS requires Java 1.8 or later."
exit 1
fi
if ! which Rscript >/dev/null 2>&1 ; then
echo "Missing R installation. Please add Rscript to PATH"
exit 1
fi
java -ea -Xmx31g -Dsamjdk.create_index=true -Dsamjdk.use_async_io_read_samtools=true -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=true -Dgridss.gridss.output_to_temp_file=true -cp $GRIDSS_JAR gridss.CallVariants TMP_DIR=. R="$REFERENCE" WORKING_DIR=. INPUT="$INPUT" OUTPUT="$OUTPUT" ASSEMBLY="$ASSEMBLY" 2>&1 | tee -a gridss.$HOSTNAME.$$.log
# BLACKLIST="$BLACKLIST" \
The first few lines just checks for existing files and the last line is the gridss script. This ran w.o any errors. I have got a lot of folders and files. The important ones are probably the vcf and bed files. Can you please help me to get a clear picture where and what to look for in this chaos?
thanks again
The VCF is the primary output file. Lets assume your transgene contig is call chr_transgene. If you've got decent coverage and neither your insertion site nor your transgene vector is unmappable, then the simplest way to look for the transgene insertion site is to just grep chr_transgene output.vcf | grep -v LOW_QUAL
and look at the highest scoring variants in that subset. You should find high scoring variants (probably QUAL > 1000) at the start and end of the transgene that go to almost the same genomic location - that's your insertion site. Remember that it's possible to have multiple insertion sites.
By default, GRIDSS reports many many variants, the vast majority of these are low quality variants that you can ignore (hence the LOW_QUAL filter I recommended). In some cases, you might want to look at them. For example, if you found one side of the transgene insertion but not the other, you'll want to look for nearby low quality variants that complete the insertion.
I strongly recommend reading section 5.4 of https://samtools.github.io/hts-specs/VCFv4.3.pdf so you have an idea of what you're looking at.
If you want to know the exact transgene insertion sequence, that can be found in the ASSEMBLY file with the read name matching the BEID field of the transgene breakpoint call (you'll probably have more than one assembly but they should have the same sequence).
Thanks for the detailed response. Unfrotunately this is what I get when grepping for results:
$grep transgene cleaned.combined.sorted.sv.vcf
##contig=<ID=transgene,length=6252>
12 113428752 gridss56_9234o T ]transgene:6194]T 110.59 LOW_QUAL;SINGLE_ASSEMBLY AS=0;ASQ=0.00;ASRP=6;ASSR=0;BA=0;BAQ=0.00;BEID=asm274-18991;BQ=0.00;BSC=0;BSCQ=0.00;BUM=0;BUMQ=0.00;CAS=0;CASQ=0.00;CIPOS=-57,58;CIRPOS=-57,58;CQ=110.59;EVENT=gridss56_9234;IC=0;IMPRECISE;IQ=0.00;PARID=gridss56_9234h;RAS=1;RASQ=94.16;REF=32;REFPAIR=18;RP=1;RPQ=16.43;SC=1X114N1X76M;SR=0;SRQ=0.00;SVTYPE=BND
GT:ASQ:ASRP:ASSR:BAQ:BQ:BSC:BSCQ:BUM:BUMQ:CASQ:IC:IQ:QUAL:RASQ:REF:REFPAIR:RP:RPQ:SR:SRQ .:0.00:6:0:0.00:0.00:0:0.00:0:0.00:0.00:0:0.00:110.59:94.16:32:18:1:16.43:0:0.00
7 36210719 gridss210_220o A A]transgene:6194] 125.73 LOW_QUAL;SINGLE_ASSEMBLY AS=1;ASQ=48.00;ASRP=3;ASSR=0;BA=0;BAQ=0.00;BEID=asm210-15597;BQ=199.80;BSC=10;BSCQ=199.80;BUM=0;BUMQ=0.00;CAS=0;CASQ=0.00;CIPOS=-191,192;CIRPOS=-57,58;CQ=125.73;EVENT=gridss210_220;IC=0;IMPRECISE;IQ=0.00;PARID=gridss210_220h;RAS=0;RASQ=0.00;REF=7;REFPAIR=7;RP=5;RPQ=77.73;SC=318M1X382N1X;SR=0;SRQ=0.00;SVTYPE=BND GT:ASQ:ASRP:ASSR:BAQ:BQ:BSC:BSCQ:BUM:BUMQ:CASQ:IC:IQ:QUAL:RASQ:REF:REFPAIR:RP:RPQ:SR:SRQ .:48.00:3:0:0.00:199.80:10:199.80:0:0.00:0.00:0:0.00:125.73:0.00:7:7:5:77.73:0:0.00
transgene 6194 gridss210_220h A A]7:36210719] 125.73 LOW_QUAL;SINGLE_ASSEMBLY AS=0;ASQ=0.00;ASRP=3;ASSR=0;BA=0;BAQ=0.00;BEID=asm210-15597;BQ=0.00;BSC=0;BSCQ=0.00;BUM=0;BUMQ=0.00;CAS=0;CASQ=0.00;CIPOS=-57,58;CIRPOS=-191,192;CQ=125.73;EVENT=gridss210_220;IC=0;IMPRECISE;IQ=0.00;PARID=gridss210_220o;RAS=1;RASQ=48.00;REF=0;REFPAIR=0;RP=5;RPQ=77.73;SC=273M121D197M1X114N1X;SR=0;SRQ=0.00;SVTYPE=BND GT:ASQ:ASRP:ASSR:BAQ:BQ:BSC:BSCQ:BUM:BUMQ:CASQ:IC:IQ:QUAL:RASQ:REF:REFPAIR:RP:RPQ:SR:SRQ .:0.00:3:0:0.00:0.00:0:0.00:0:0.00:0.00:0:0.00:125.73:48.00:0:0:5:77.73:0:0.00
transgene 6194 gridss56_9234h A A[12:113428752[ 110.59 LOW_QUAL;SINGLE_ASSEMBLY AS=1;ASQ=94.16;ASRP=6;ASSR=0;BA=0;BAQ=0.00;BEID=asm274-18991;BQ=0.00;BSC=0;BSCQ=0.00;BUM=0;BUMQ=0.00;CAS=0;CASQ=0.00;CIPOS=-57,58;CIRPOS=-57,58;CQ=110.59;EVENT=gridss56_9234;IC=0;IMPRECISE;IQ=0.00;PARID=gridss56_9234o;RAS=0;RASQ=0.00;REF=0;REFPAIR=0;RP=1;RPQ=16.43;SC=151M39D1X114N1X;SR=0;SRQ=0.00;SVTYPE=BND GT:ASQ:ASRP:ASSR:BAQ:BQ:BSC:BSCQ:BUM:BUMQ:CASQ:IC:IQ:QUAL:RASQ:REF:REFPAIR:RP:RPQ:SR:SRQ .:94.16:6:0:0.00:0.00:0:0.00:0:0.00:0.00:0:0.00:110.59:0.00:0:0:1:16.43:0:0.00
Does this means, that no insertion site was identified with good enough quality to be sure about the event? Is there a way to re-run the analysis with less strict parameters to try and find more hits?
thanks again for all the help Assa
The transgene size is approx. 6.2Mb.
Your transgene is 6.2Kb. Did you mean Kb, or Mb? I'm going to assume it's a 6kb transgene.
What depth of coverage do you have? Those two insertion sites aren't actually that terrible if you don't have high coverage.
Is there a way to re-run the analysis with less strict parameters to try and find more hits?
I recommend adding CONFIGURATION_FILE=gridss.properties to your GRIDSS command line and creating a gridss.properties file containing the following:
variantcalling.callBreakends = true
Delete all the intermediates/outputs from the variant calling stage (rm -r *vcf*
), rerun GRIDSS (you don't need to delete the input file .gridss.working nor any of the assemblies so this rerun should be fast), then grep transgene cleaned.combined.sorted.sv.vcf
. Your output should now include single breakends (see Section 5.4.9 of https://samtools.github.io/hts-specs/VCFv4.3.pdf) which you can use to confirm/BLAST the insertion sites.
Another option is that your transgene (TG) has a strong sequence homology with a gene (G) in the reference genome. If this is the case, then all the reads from both TG and G will have very poor mapping quality. Check in IGV what the MAPQ score for the reads mapping to transgene are. If they're mostly low (<=10) that's a problem.
The simplest solution to this problem is to replace the homologous sequence G in the reference genome with a string of N bases. Aligning to the ref + TG with G masked out will prevent the MAPQ problem but all the reads from G will be placed in TG by the aligner. This isn't actually a big problem as you know the location of G in the reference genome so you can just ignore the breakpoints that say TG is connected to the flanking sequence of G.
Your transgene is 6.2Kb. Did you mean Kb, or Mb? I'm going to assume it's a 6kb transgene.
Yes sorry, 6252bp for precision.
What depth of coverage do you have? Those two insertion sites aren't actually that terrible if you don't have high coverage.
I think this is not such a straight forward answer. How do I check the depth of the coverage? Do you mean reads/bp?
so basically : "how many bp [#read * length(reads)] were mapped" / "how many bp I have in the genome + transgene"
.
$samtools flagstat -@12 cleaned.combined.sorted.bam
727276955 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
4632165 + 0 supplementary
0 + 0 duplicates
725864438 + 0 mapped (99.81% : N/A)
722644790 + 0 paired in sequencing
361322395 + 0 read1
361322395 + 0 read2
707230990 + 0 properly paired (97.87% : N/A)
720039774 + 0 with itself and mate mapped
1192499 + 0 singletons (0.17% : N/A)
10813600 + 0 with mate mapped to a different chr
6387215 + 0 with mate mapped to a different chr (mapQ>=5)
'only' ~10M reads mapped to different chromosomes (~3%). Should this be enough?
Do you think it makes more sense to first map my paired-end sample to only the transgene sequence. Than use the unmapped reads (either mapping with bowtie, which creates an unmapped files or using samtools view -f 4 bam > unmapped.sam
to extract unmapped reads from a bwa-mapped bam file) and map only them to the mouse genome (only).
I can than merge the two bam files and use the merged.bam as the input file for gridss. This will give me better mapping for the transgene.
Will gridss be able to identify inter-chromosomal events between the transgene and genomic chromosomes from this kind of mapping?
How do I check the depth of the coverage?
Just a simple #reads * length(reads) / length(genome)
.
If you've use 2x100bp sequencing then you have ~25x coverage.
I'm attempting to work out what your data issue is.
If you look at your transgene in IGV, what's your coverage? Do you have about 25 reads aligned to each position? 50? 10? What's the mapq of the reads aligned to your transgene?
Do you think it makes more sense to first map my paired-end sample to only the transgene sequence. Then...
Not really. GRIDSS requires the reference genome supplied to GRIDSS matches the reference used to align the input file and it'd be fiddly and error-prone to hack it together.
This will give me better mapping for the transgene.
It'll map all the reads that actually came from homology sequence in the mouse genome to the transgene. You can achieve the same outcome by masking (converting to Ns) the homology regions in your existing combined genome. If you BLAST your 6kp transgenic sequence against the mouse genome, what hits do you get? If you get good hits then that further supports the sequence homology hypothesis.
Aligning to just the transgene is problematic as you'll get a whole lot of false positive soft-clipped reads aligning to the transgene. Any low complexity sequence (e.g polyA) in the transgene will have all the reads from the mouse genome with similar low complexity sequence all mapping to the transgene. Ideally, you want to map to both.
If you BLAST your 6kb transgenic sequence against the mouse genome, what hits do you get? If you get good hits then that further supports the sequence homology hypothesis.
Here is the BLAST results of my 6.2Kb transgene seq against the mouse genome.
The first hit on chr 17 is ~2000 bases at the beginning of the transgene sequence (100% identities), the second hit from chr 12 ~1600 bases long at the end of the transgene sequence99% identities). If I understand it correctly, these are the important two regions to identify the insertion site. Do you think it has such an impact on the mapping results to the transgene?
Another option is that your transgene (TG) has a strong sequence homology with a gene (G) in the reference genome. If this is the case, then all the reads from both TG and G will have very poor mapping quality. Check in IGV what the MAPQ score for the reads mapping to transgene are. If they're mostly low (<=10) that's a problem.
Here are the mapping results for the transgene. The highest peaks is at 37 reads, but at both ends of the sequence it is a much smaller coverage.
region 1-300
region 5900-6252
Do you think it would make sense to convert the two hits from the BLAST in to N's and rerun the mapping and the following workflow? It should give me more hits to the ends of the transgene.
Do you think it has such an impact on the mapping results to the transgene?
It definitely does. You can see in the IGV screenshot that every single read in the first 2kb of the transgene is multi-mapping (coloured white instead of grey). GRIDSS won't give you any hits at all on that side of the transgene since it ignores multi-mapping reads. A lot of the reads that do align there actually come from the homologous mouse sequence. You can clearly see this by the presence of the het SNV around position 1600. If you mask out the homologous sequences then you'll get all the reads mapping to the transgene, as well as apparent translocations between the transgene and the homology which you can just ignore.
Are the 7 SNV and 3 indels in the transgenic construct expected? Any difference could make the transgenic construct non-functional and it's somewhat concerning that the reads indicate that the sequenced transgene is so different from the transgene fasta.
region 5900-6252 looks better but the presence of mapq 0 reads is somewhat concerning. Is there low complexity sequence at this end of the transgene? What does this region look like in trimmomatic.combined.sorted.gridss.assembly.bam.gridss.working/trimmomatic.combined.sorted.gridss.assembly.bam.sv.bam in IGV if you enable viewing of soft-clipped bases? What you want to see is assembly contigs that have long right-pointing soft clips that hang off the end of the transgene. Are any of these present?
What does this region look like in trimmomatic.combined.sorted.gridss.assembly.bam.gridss.working/trimmomatic.combined.sorted.gridss.assembly.bam.sv.bam in IGV if you enable viewing of soft-clipped bases?
Here are the images of both ends with soft-clipping of the file you mentioned.
To be honest i am not really sure what i see here.
What you want to see is assembly contigs that have long right-pointing soft clips that hang off the end of the transgene. Are any of these present?
What are the soft-clipping I see in the images? Is each "read" stands here for a contig of reads containing multiple mapped reads. Why are there "only" three reads here?
So I will try to convert the positions identified in BLAST to N's and re-run the complete analysis. We'll see how better it can get :-)
What are the soft-clipping I see in the images? Is each "read" stands here for a contig of reads containing multiple mapped reads.
Each 'read' is a contig assembled from soft clipped reads, split reads, discordant read pairs, indel-containing reads, and read pairs with only one read mapping (i.e. all the SV-supporting reads), that mutually support a putative SV. These are the 'primary' read alignment. These are then realigned and supplementary records are created that align the soft-clipped bases to the other side of the breakpoint.
Why are there "only" three reads here?
In an ideal scenario, you would have two 'reads' (aka contigs) for each breakpoint. One from reads assembled at the transgene location, and one from reads assembled at the genomic insertion site.
To be honest i am not really sure what i see here.
What you see is the chimeric alignment of the assembled contigs.
asm274-18991 was assembled from reads that mapped past the end of the transgene contigs and appears to be a compound breakpoint involving both chr12 and chr7 (the "SupplementaryAlignments" section of the mouse-over).
asm209-77534 looks like it was assembled from reads that mapped to a different location of chr7 and mapped (very badly - mapq=0) to the transgene. Is chr7:28,985,940-28,986,315 flanking one of the regions of homology?
TL:DR it looks like you have chr7:28,986,315 connected to the transgene connected to chr7:26310529-36210210 connected to chr12:113428810. That doesn't seem very plausible to me. A clean transgene insertion would be chr:pos <-> transgene <-> chr:pos+-30ish base pairs (DNA repair can result in a handful of duplicated or removed bases at the insertion site depending on the repair mechanism used).
asm209-77534 looks like it was assembled from reads that mapped to a different location of chr7 and mapped (very badly - mapq=0) to the transgene. Is chr7:28,985,940-28,986,315 flanking one of the regions of homology?
If I understood the biologist, the first 2Kb of the transgene comes from endogenous mouse genomic sequence from chr 17, the last 1.6Kb from chr12. This is probably why it shows such bad mapq values, but it is not from chr7.
TL:DR it looks like you have chr7:28,986,315 connected to the transgene connected to chr7:26310529-36210210 connected to chr12:113428810. That doesn't seem very plausible to me. A clean transgene insertion would be chr:pos <-> transgene <-> chr:pos+-30ish base pairs (DNA repair can result in a handful of duplicated or removed bases at the insertion site depending on the repair mechanism used).
I'm not really sure how you inffered this from the two breakends. is it possible to have here two independent insertion points?
I have masked the mentioned regions (chr12 and chr17) from the BLAST results and re-run the analysis. This is how the mapping looks line now in comaprision
The top bam file was done against the masked genome, while the bottom one is the original with the lower qualities.
Instead of now just one contig on each side, I seems to have four/five different contigs. See images and details of each of them (from top to bottom) below:
The read information for each of the reads (=contigs) is under this link (if this is not working I can also paste it here).
I can see it looks better now than before, but I am not really sure why I see four contigs now on each side. Are these multiple possible insertion sites?
When grepping for transgene in the vcf results file I get the following hits:
grep transgene secondRun.maskedGenome/cleaned.combined.masked.sorted.sv.vcf
##contig=<ID=transgene,length=6252>
7 28985940 gridss102_10775o T [transgene:5[T 481.85 LOW_QUAL AS=1;ASQ=189.18;ASRP=10;ASSR=8;BA=0;BAQ=0.00;BEID=asm102-77226,asm273-2053;BQ=59.13;BSC=2;BSCQ=31.68;BUM=1;BUMQ=27.46;CAS=0;CASQ=0.00;CIPOS=0,2;CIRPOS=-2,0;CQ=481.85;EVENT=gridss102_10775;HOMLEN=2;HOMSEQ=TC;IC=0;IHOMPOS=0,2;IQ=0.00;PARID=gridss102_10775h;RAS=1;RASQ=135.16;REF=24;REFPAIR=16;RP=6;RPQ=99.61;SC=1X1N1X373M;SR=3;SRQ=57.90;SVTYPE=BND GT:ASQ:ASRP:ASSR:BAQ:BQ:BSC:BSCQ:BUM:BUMQ:CASQ:IC:IQ:QUAL:RASQ:REF:REFPAIR:RP:RPQ:SR:SRQ .:189.18:10:8:0.00:59.13:2:31.68:1:27.46:0.00:0:0.00:481.85:135.16:24:16:6:99.61:3:57.90
7 36210528 gridss103_684o G GAGGAATTCGGGAGCTTGAAGT]transgene:6232] 1317.41 . AS=2;ASQ=598.95;ASRP=27;ASSR=16;BA=0;BAQ=0.00;BEID=asm103-15595,asm103-15624,asm273-1214;BQ=46.51;BSC=3;BSCQ=46.51;BUM=0;BUMQ=0.00;CAS=0;CASQ=0.00;CQ=1294.64;EVENT=gridss103_684;IC=0;IHOMPOS=21,-21;IQ=0.00;PARID=gridss103_684h;RAS=1;RASQ=166.01;REF=32;REFPAIR=11;RP=17;RPQ=282.22;SC=605M1X;SR=13;SRQ=270.22;SVTYPE=BND GT:ASQ:ASRP:ASSR:BAQ:BQ:BSC:BSCQ:BUM:BUMQ:CASQ:IC:IQ:QUAL:RASQ:REF:REFPAIR:RP:RPQ:SR:SRQ .:598.95:27:16:0.00:46.51:3:46.51:0:0.00:0.00:0:0.00:1317.41:166.01:32:11:17:282.22:13:270.22
12 113427284 gridss176_3143o G G[transgene:4643[ 3661.30 . AS=1;ASQ=1280.81;ASRP=75;ASSR=69;BA=0;BAQ=0.00;BEID=asm176-20960,asm273-2055;BQ=117.43;BSC=8;BSCQ=117.43;BUM=0;BUMQ=0.00;CAS=0;CASQ=0.00;CIPOS=-1,0;CIRPOS=-1,0;CQ=3641.37;EVENT=gridss176_3143;HOMLEN=1;HOMSEQ=G;IC=0;IHOMPOS=-1,0;IQ=0.00;PARID=gridss176_3143h;RAS=1;RASQ=1217.11;REF=0;REFPAIR=0;RP=38;RPQ=630.86;SC=470M2X;SR=28;SRQ=532.52;SVTYPE=BND GT:ASQ:ASRP:ASSR:BAQ:BQ:BSC:BSCQ:BUM:BUMQ:CASQ:IC:IQ:QUAL:RASQ:REF:REFPAIR:RP:RPQ:SR:SRQ .:1280.81:75:69:0.00:117.43:8:117.43:0:0.00:0.00:0:0.00:3661.30:1217.11:0:0:38:630.86:28:532.52
12 113428877 gridss176_6753o A ]transgene:6232]A 3474.18 . AS=1;ASQ=1063.75;ASRP=77;ASSR=69;BA=0;BAQ=0.00;BEID=asm176-67713,asm273-1195;BQ=104.22;BSC=7;BSCQ=104.22;BUM=0;BUMQ=0.00;CAS=0;CASQ=0.00;CQ=3420.67;EVENT=gridss176_6753;IC=0;IHOMPOS=0,0;IQ=0.00;PARID=gridss176_6753h;RAS=1;RASQ=1450.90;REF=2;REFPAIR=0;RP=36;RPQ=597.65;SC=1X396M99D140M;SR=20;SRQ=361.88;SVTYPE=BND GT:ASQ:ASRP:ASSR:BAQ:BQ:BSC:BSCQ:BUM:BUMQ:CASQ:IC:IQ:QUAL:RASQ:REF:REFPAIR:RP:RPQ:SR:SRQ .:1063.75:77:69:0.00:104.22:7:104.22:0:0.00:0.00:0:0.00:3474.18:1450.90:2:0:36:597.65:20:361.88
17 34000277 gridss225_169o C C]transgene:2033] 1247.20 . AS=2;ASQ=669.03;ASRP=27;ASSR=23;BA=0;BAQ=0.00;BEID=asm225-18260,asm225-18261,asm273-1190;BQ=54.91;BSC=0;BSCQ=0.00;BUM=2;BUMQ=54.91;CAS=0;CASQ=0.00;CQ=1230.60;EVENT=gridss225_169;IC=0;IHOMPOS=0,48;IQ=0.00;PARID=gridss225_169h;RAS=1;RASQ=141.71;REF=20;REFPAIR=0;RP=21;RPQ=348.63;SC=533M1X;SR=5;SRQ=87.82;SVTYPE=BND GT:ASQ:ASRP:ASSR:BAQ:BQ:BSC:BSCQ:BUM:BUMQ:CASQ:IC:IQ:QUAL:RASQ:REF:REFPAIR:RP:RPQ:SR:SRQ .:669.03:27:23:0.00:54.91:0:0.00:2:54.91:0.00:0:0.00:1247.20:141.71:20:0:21:348.63:5:87.82
17 34000324 gridss225_170o C C]transgene:1986] 531.47 SINGLE_ASSEMBLY AS=0;ASQ=0.00;ASRP=26;ASSR=1;BA=0;BAQ=0.00;BEID=asm273-1189;BQ=132.98;BSC=9;BSCQ=132.98;BUM=0;BUMQ=0.00;CAS=0;CASQ=0.00;CQ=1145.19;EVENT=gridss225_170;IC=0;IHOMPOS=-48,0;IQ=0.00;PARID=gridss225_170h;RAS=1;RASQ=448.47;REF=0;REFPAIR=0;RP=5;RPQ=83.01;SC=483M1X;SR=0;SRQ=0.00;SVTYPE=BND GT:ASQ:ASRP:ASSR:BAQ:BQ:BSC:BSCQ:BUM:BUMQ:CASQ:IC:IQ:QUAL:RASQ:REF:REFPAIR:RP:RPQ:SR:SRQ .:0.00:26:1:0.00:132.98:9:132.98:0:0.00:0.00:0:0.00:531.47:448.47:0:0:5:83.01:0:0.00
17 34002304 gridss225_5474o T [transgene:7[T 1410.46 . AS=1;ASQ=563.58;ASRP=48;ASSR=15;BA=0;BAQ=0.00;BEID=asm225-67082,asm273-2052;BQ=129.97;BSC=7;BSCQ=102.52;BUM=1;BUMQ=27.46;CAS=0;CASQ=0.00;CQ=1410.46;EVENT=gridss225_5474;IC=0;IHOMPOS=0,0;IQ=0.00;PARID=gridss225_5474h;RAS=1;RASQ=496.29;REF=0;REFPAIR=0;RP=19;RPQ=315.43;SC=1X442M;SR=2;SRQ=35.17;SVTYPE=BND GT:ASQ:ASRP:ASSR:BAQ:BQ:BSC:BSCQ:BUM:BUMQ:CASQ:IC:IQ:QUAL:RASQ:REF:REFPAIR:RP:RPQ:SR:SRQ .:563.58:48:15:0.00:129.97:7:102.52:1:27.46:0.00:0:0.00:1410.46:496.29:0:0:19:315.43:2:35.17
transgene 5 gridss102_10775h G [7:28985940[G 481.85 LOW_QUAL AS=1;ASQ=135.16;ASRP=10;ASSR=8;BA=0;BAQ=0.00;BEID=asm102-77226,asm273-2053;BQ=0.00;BSC=0;BSCQ=0.00;BUM=0;BUMQ=0.00;CAS=0;CASQ=0.00;CIPOS=-2,0;CIRPOS=0,2;CQ=481.85;EVENT=gridss102_10775;HOMLEN=2;HOMSEQ=GA;IC=0;IHOMPOS=-2,0;IQ=0.00;PARID=gridss102_10775o;RAS=1;RASQ=189.18;REF=0;REFPAIR=0;RP=6;RPQ=99.61;SC=1X1N1X399M;SR=3;SRQ=57.90;SVTYPE=BND GT:ASQ:ASRP:ASSR:BAQ:BQ:BSC:BSCQ:BUM:BUMQ:CASQ:IC:IQ:QUAL:RASQ:REF:REFPAIR:RP:RPQ:SR:SRQ .:135.16:10:8:0.00:0.00:0:0.00:0:0.00:0.00:0:0.00:481.85:189.18:0:0:6:99.61:3:57.90
transgene 7 gridss225_5474h A [17:34002304[A 1410.46 . AS=1;ASQ=496.29;ASRP=48;ASSR=15;BA=0;BAQ=0.00;BEID=asm225-67082,asm273-2052;BQ=194.86;BSC=4;BSCQ=57.58;BUM=5;BUMQ=137.28;CAS=0;CASQ=0.00;CQ=1410.46;EVENT=gridss225_5474;IC=0;IHOMPOS=0,0;IQ=0.00;PARID=gridss225_5474o;RAS=1;RASQ=563.58;REF=3;REFPAIR=0;RP=19;RPQ=315.43;SC=1X555M;SR=2;SRQ=35.17;SVTYPE=BND GT:ASQ:ASRP:ASSR:BAQ:BQ:BSC:BSCQ:BUM:BUMQ:CASQ:IC:IQ:QUAL:RASQ:REF:REFPAIR:RP:RPQ:SR:SRQ .:496.29:48:15:0.00:194.86:4:57.58:5:137.28:0.00:0:0.00:1410.46:563.58:3:0:19:315.43:2:35.17
transgene 1986 gridss225_170h G G]17:34000324] 531.47 SINGLE_ASSEMBLY AS=1;ASQ=448.47;ASRP=26;ASSR=1;BA=0;BAQ=0.00;BEID=asm273-1189;BQ=0.00;BSC=0;BSCQ=0.00;BUM=0;BUMQ=0.00;CAS=0;CASQ=0.00;CQ=1145.19;EVENT=gridss225_170;IC=0;IHOMPOS=0,48;IQ=0.00;PARID=gridss225_170o;RAS=0;RASQ=0.00;REF=38;REFPAIR=7;RP=5;RPQ=83.01;SC=288M1X;SR=0;SRQ=0.00;SVTYPE=BND GT:ASQ:ASRP:ASSR:BAQ:BQ:BSC:BSCQ:BUM:BUMQ:CASQ:IC:IQ:QUAL:RASQ:REF:REFPAIR:RP:RPQ:SR:SRQ .:448.47:26:1:0.00:0.00:0:0.00:0:0.00:0.00:0:0.00:531.47:0.00:38:7:5:83.01:0:0.00
transgene 2033 gridss225_169h G G]17:34000277] 1247.20 . AS=1;ASQ=141.71;ASRP=27;ASSR=23;BA=0;BAQ=0.00;BEID=asm225-18260,asm225-18261,asm273-1190;BQ=70.72;BSC=5;BSCQ=70.72;BUM=0;BUMQ=0.00;CAS=0;CASQ=0.00;CQ=1230.60;EVENT=gridss225_169;IC=0;IHOMPOS=-48,0;IQ=0.00;PARID=gridss225_169o;RAS=2;RASQ=669.03;REF=22;REFPAIR=7;RP=21;RPQ=348.63;SC=472M1X;SR=5;SRQ=87.82;SVTYPE=BND GT:ASQ:ASRP:ASSR:BAQ:BQ:BSC:BSCQ:BUM:BUMQ:CASQ:IC:IQ:QUAL:RASQ:REF:REFPAIR:RP:RPQ:SR:SRQ .:141.71:27:23:0.00:70.72:5:70.72:0:0.00:0.00:0:0.00:1247.20:669.03:22:7:21:348.63:5:87.82
transgene 4643 gridss176_3143h A ]12:113427284]A 3661.30 . AS=1;ASQ=1217.11;ASRP=75;ASSR=69;BA=0;BAQ=0.00;BEID=asm176-20960,asm273-2055;BQ=70.33;BSC=5;BSCQ=70.33;BUM=0;BUMQ=0.00;CAS=0;CASQ=0.00;CIPOS=-1,0;CIRPOS=-1,0;CQ=3641.37;EVENT=gridss176_3143;HOMLEN=1;HOMSEQ=G;IC=0;IHOMPOS=-1,0;IQ=0.00;PARID=gridss176_3143o;RAS=1;RASQ=1280.81;REF=23;REFPAIR=16;RP=38;RPQ=630.86;SC=2X566M;SR=28;SRQ=532.52;SVTYPE=BND GT:ASQ:ASRP:ASSR:BAQ:BQ:BSC:BSCQ:BUM:BUMQ:CASQ:IC:IQ:QUAL:RASQ:REF:REFPAIR:RP:RPQ:SR:SRQ .:1217.11:75:69:0.00:70.33:5:70.33:0:0.00:0.00:0:0.00:3661.30:1280.81:23:16:38:630.86:28:532.52
transgene 6232 gridss103_684h G GACTTCAAGCTCCCGAATTCCT]7:36210528] 1317.41 . AS=1;ASQ=166.01;ASRP=27;ASSR=16;BA=0;BAQ=0.00;BEID=asm103-15595,asm103-15624,asm273-1214;BQ=0.00;BSC=0;BSCQ=0.00;BUM=0;BUMQ=0.00;CAS=0;CASQ=0.00;CQ=1294.64;EVENT=gridss103_684;IC=0;IHOMPOS=21,-21;IQ=0.00;PARID=gridss103_684o;RAS=2;RASQ=598.95;REF=1;REFPAIR=0;RP=17;RPQ=282.22;SC=686M1X;SR=13;SRQ=270.22;SVTYPE=BNDGT:ASQ:ASRP:ASSR:BAQ:BQ:BSC:BSCQ:BUM:BUMQ:CASQ:IC:IQ:QUAL:RASQ:REF:REFPAIR:RP:RPQ:SR:SRQ .:166.01:27:16:0.00:0.00:0:0.00:0:0.00:0.00:0:0.00:1317.41:598.95:1:0:17:282.22:13:270.22
transgene 6232 gridss176_6753h G G[12:113428877[ 3474.18 . AS=1;ASQ=1450.90;ASRP=77;ASSR=69;BA=0;BAQ=0.00;BEID=asm176-67713,asm273-1195;BQ=220.50;BSC=9;BSCQ=138.13;BUM=3;BUMQ=82.37;CAS=0;CASQ=0.00;CQ=3420.67;EVENT=gridss176_6753;IC=0;IHOMPOS=0,0;IQ=0.00;PARID=gridss176_6753o;RAS=1;RASQ=1063.75;REF=1;REFPAIR=0;RP=36;RPQ=597.65;SC=388M1D151M1X;SR=20;SRQ=361.88;SVTYPE=BND GT:ASQ:ASRP:ASSR:BAQ:BQ:BSC:BSCQ:BUM:BUMQ:CASQ:IC:IQ:QUAL:RASQ:REF:REFPAIR:RP:RPQ:SR:SRQ .:1450.90:77:69:0.00:220.50:9:138.13:3:82.37:0.00:0:0.00:3474.18:1063.75:1:0:36:597.65:20:361.88
Just to mention in advance, the run was done with the config file and the added row in it variantcalling.callBreakends = true
, so it also includes singleend breaks.
Again the same question - do I have here multiple possible insertion points?
thanks a lot for the help
Re-running the data and setting variantcalling.callBreakends = false
gave me these results, which looks a bit better and more to what I was expecting.
E.g. I was hoping for two insertion points in chr7 :-)
grep transgene trimmomatic.combined.masked.sorted.sv.vcf | grep -v LOW_QUAL
##contig=<ID=transgene,length=6252>
7 28985942 gridss102_6855o C [transgene:5[C 537.16 . AS=2;ASQ=212.64;ASRP=11;ASSR=6;BA=0;BAQ=0.00;BEID=asm102-21303,asm102-21312,asm273-801;BQ=49.49;BSC=2;BSCQ=49.49;BUM=0;BUMQ=0.00;CAS=0;CASQ=0.00;CIPOS=-2,0;CIRPOS=0,2;CQ=537.16;EVENT=gridss102_6855;HOMLEN=2;HOMSEQ=TC;IC=0;IHOMPOS=-1,0;IQ=0.00;PARID=gridss102_6855h;RAS=1;RASQ=161.37;REF=21;REFPAIR=18;RP=6;RPQ=111.67;SC=1X1N1X372M;SR=2;SRQ=51.47;SVTYPE=BND GT:ASQ:ASRP:ASSR:BAQ:BQ:BSC:BSCQ:BUM:BUMQ:CASQ:IC:IQ:QUAL:RASQ:REF:REFPAIR:RP:RPQ:SR:SRQ .:212.64:11:6:0.00:49.49:2:49.49:0:0.00:0.00:0:0.00:537.16:161.37:21:18:6:111.67:2:51.47
7 36210528 gridss103_460o G GAGGAATTCGGGAGCTTGAAGT]transgene:6232] 1491.70 . AS=1;ASQ=680.21;ASRP=22;ASSR=14;BA=0;BAQ=0.00;BEID=asm103-4518,asm273-429;BQ=73.46;BSC=3;BSCQ=73.46;BUM=0;BUMQ=0.00;CAS=0;CASQ=0.00;CQ=1491.70;EVENT=gridss103_460;IC=0;IHOMPOS=21,-21;IQ=0.00;PARID=gridss103_460h;RAS=1;RASQ=167.51;REF=32;REFPAIR=11;RP=15;RPQ=279.19;SC=151M98D356M1X;SR=11;SRQ=364.79;SVTYPE=BND GT:ASQ:ASRP:ASSR:BAQ:BQ:BSC:BSCQ:BUM:BUMQ:CASQ:IC:IQ:QUAL:RASQ:REF:REFPAIR:RP:RPQ:SR:SRQ .:680.21:22:14:0.00:73.46:3:73.46:0:0.00:0.00:0:0.00:1491.70:167.51:32:11:15:279.19:11:364.79
12 113427284 gridss176_2066o G G[transgene:4643[ 4587.64 . AS=1;ASQ=1633.33;ASRP=72;ASSR=61;BA=0;BAQ=0.00;BEID=asm176-5709,asm273-803;BQ=190.16;BSC=8;BSCQ=190.16;BUM=0;BUMQ=0.00;CAS=0;CASQ=0.00;CIPOS=-1,0;CIRPOS=-1,0;CQ=4587.64;EVENT=gridss176_2066;HOMLEN=1;HOMSEQ=G;IC=0;IHOMPOS=-1,0;IQ=0.00;PARID=gridss176_2066h;RAS=1;RASQ=1511.15;REF=0;REFPAIR=0;RP=36;RPQ=670.05;SC=341M2X;SR=25;SRQ=773.12;SVTYPE=BND GT:ASQ:ASRP:ASSR:BAQ:BQ:BSC:BSCQ:BUM:BUMQ:CASQ:IC:IQ:QUAL:RASQ:REF:REFPAIR:RP:RPQ:SR:SRQ .:1633.33:72:61:0.00:190.16:8:190.16:0:0.00:0.00:0:0.00:4587.64:1511.15:0:0:36:670.05:25:773.12
12 113428877 gridss176_4432o A ]transgene:6232]A 4062.86 . AS=1;ASQ=1253.83;ASRP=70;ASSR=57;BA=0;BAQ=0.00;BEID=asm176-19980,asm273-425;BQ=144.21;BSC=6;BSCQ=144.21;BUM=0;BUMQ=0.00;CAS=0;CASQ=0.00;CQ=3996.24;EVENT=gridss176_4432;IC=0;IHOMPOS=0,0;IQ=0.00;PARID=gridss176_4432h;RAS=1;RASQ=1699.40;REF=2;REFPAIR=0;RP=33;RPQ=614.21;SC=1X381M174D80M;SR=17;SRQ=495.42;SVTYPE=BND GT:ASQ:ASRP:ASSR:BAQ:BQ:BSC:BSCQ:BUM:BUMQ:CASQ:IC:IQ:QUAL:RASQ:REF:REFPAIR:RP:RPQ:SR:SRQ .:1253.83:70:57:0.00:144.21:6:144.21:0:0.00:0.00:0:0.00:4062.86:1699.40:2:0:33:614.21:17:495.42
17 34000277 gridss225_105o C C]transgene:2033] 1342.59 . AS=2;ASQ=744.76;ASRP=24;ASSR=19;BA=0;BAQ=0.00;BEID=asm225-5027,asm225-5028,asm273-422;BQ=0.00;BSC=0;BSCQ=0.00;BUM=0;BUMQ=0.00;CAS=0;CASQ=0.00;CQ=1342.59;EVENT=gridss225_105;IC=0;IHOMPOS=0,48;IQ=0.00;PARID=gridss225_105h;RAS=1;RASQ=172.58;REF=18;REFPAIR=0;RP=17;RPQ=309.80;SC=436M1X;SR=4;SRQ=115.45;SVTYPE=BND GT:ASQ:ASRP:ASSR:BAQ:BQ:BSC:BSCQ:BUM:BUMQ:CASQ:IC:IQ:QUAL:RASQ:REF:REFPAIR:RP:RPQ:SR:SRQ .:744.76:24:19:0.00:0.00:0:0.00:0:0.00:0.00:0:0.00:1342.59:172.58:18:0:17:309.80:4:115.45
17 34000324 gridss225_106o C C]transgene:1986] 579.44 SINGLE_ASSEMBLY AS=0;ASQ=0.00;ASRP=24;ASSR=1;BA=0;BAQ=0.00;BEID=asm273-421;BQ=182.61;BSC=8;BSCQ=182.61;BUM=0;BUMQ=0.00;CAS=0;CASQ=0.00;CQ=1324.86;EVENT=gridss225_106;IC=0;IHOMPOS=-48,0;IQ=0.00;PARID=gridss225_106h;RAS=1;RASQ=467.77;REF=0;REFPAIR=0;RP=6;RPQ=111.67;SC=483M1X;SR=0;SRQ=0.00;SVTYPE=BND GT:ASQ:ASRP:ASSR:BAQ:BQ:BSC:BSCQ:BUM:BUMQ:CASQ:IC:IQ:QUAL:RASQ:REF:REFPAIR:RP:RPQ:SR:SRQ .:0.00:24:1:0.00:182.61:8:182.61:0:0.00:0.00:0:0.00:579.44:467.77:0:0:6:111.67:0:0.00
17 34002304 gridss225_3419o T [transgene:7[T 1634.38 . AS=1;ASQ=668.18;ASRP=47;ASSR=14;BA=0;BAQ=0.00;BEID=asm225-19363,asm273-800;BQ=174.83;BSC=6;BSCQ=139.38;BUM=1;BUMQ=35.45;CAS=0;CASQ=0.00;CQ=1634.38;EVENT=gridss225_3419;IC=0;IHOMPOS=0,0;IQ=0.00;PARID=gridss225_3419h;RAS=1;RASQ=562.31;REF=0;REFPAIR=0;RP=19;RPQ=350.02;SC=1X416M;SR=2;SRQ=53.87;SVTYPE=BND GT:ASQ:ASRP:ASSR:BAQ:BQ:BSC:BSCQ:BUM:BUMQ:CASQ:IC:IQ:QUAL:RASQ:REF:REFPAIR:RP:RPQ:SR:SRQ .:668.18:47:14:0.00:174.83:6:139.38:1:35.45:0.00:0:0.00:1634.38:562.31:0:0:19:350.02:2:53.87
transgene 5 gridss102_6855h G [7:28985942[G 537.16 . AS=1;ASQ=161.37;ASRP=11;ASSR=6;BA=0;BAQ=0.00;BEID=asm102-21303,asm102-21312,asm273-801;BQ=0.00;BSC=0;BSCQ=0.00;BUM=0;BUMQ=0.00;CAS=0;CASQ=0.00;CIPOS=0,2;CIRPOS=-2,0;CQ=537.16;EVENT=gridss102_6855;HOMLEN=2;HOMSEQ=GA;IC=0;IHOMPOS=0,1;IQ=0.00;PARID=gridss102_6855o;RAS=2;RASQ=212.64;REF=2;REFPAIR=0;RP=6;RPQ=111.67;SC=1X1N1X243M3D171M;SR=2;SRQ=51.47;SVTYPE=BND GT:ASQ:ASRP:ASSR:BAQ:BQ:BSC:BSCQ:BUM:BUMQ:CASQ:IC:IQ:QUAL:RASQ:REF:REFPAIR:RP:RPQ:SR:SRQ .:161.37:11:6:0.00:0.00:0:0.00:0:0.00:0.00:0:0.00:537.16:212.64:2:0:6:111.67:2:51.47
transgene 7 gridss225_3419h A [17:34002304[A 1634.38 . AS=1;ASQ=562.31;ASRP=47;ASSR=14;BA=0;BAQ=0.00;BEID=asm225-19363,asm273-800;BQ=169.86;BSC=3;BSCQ=63.52;BUM=3;BUMQ=106.34;CAS=0;CASQ=0.00;CQ=1634.38;EVENT=gridss225_3419;IC=0;IHOMPOS=0,0;IQ=0.00;PARID=gridss225_3419o;RAS=1;RASQ=668.18;REF=2;REFPAIR=0;RP=19;RPQ=350.02;SC=1X515M;SR=2;SRQ=53.87;SVTYPE=BND GT:ASQ:ASRP:ASSR:BAQ:BQ:BSC:BSCQ:BUM:BUMQ:CASQ:IC:IQ:QUAL:RASQ:REF:REFPAIR:RP:RPQ:SR:SRQ .:562.31:47:14:0.00:169.86:3:63.52:3:106.34:0.00:0:0.00:1634.38:668.18:2:0:19:350.02:2:53.87
transgene 1986 gridss225_106h G G]17:34000324] 579.44 SINGLE_ASSEMBLY AS=1;ASQ=467.77;ASRP=24;ASSR=1;BA=0;BAQ=0.00;BEID=asm273-421;BQ=0.00;BSC=0;BSCQ=0.00;BUM=0;BUMQ=0.00;CAS=0;CASQ=0.00;CQ=1324.86;EVENT=gridss225_106;IC=0;IHOMPOS=0,48;IQ=0.00;PARID=gridss225_106o;RAS=0;RASQ=0.00;REF=29;REFPAIR=7;RP=6;RPQ=111.67;SC=288M1X;SR=0;SRQ=0.00;SVTYPE=BND GT:ASQ:ASRP:ASSR:BAQ:BQ:BSC:BSCQ:BUM:BUMQ:CASQ:IC:IQ:QUAL:RASQ:REF:REFPAIR:RP:RPQ:SR:SRQ .:467.77:24:1:0.00:0.00:0:0.00:0:0.00:0.00:0:0.00:579.44:0.00:29:7:6:111.67:0:0.00
transgene 2033 gridss225_105h G G]17:34000277] 1342.59 . AS=1;ASQ=172.58;ASRP=24;ASSR=19;BA=0;BAQ=0.00;BEID=asm225-5027,asm225-5028,asm273-422;BQ=84.82;BSC=4;BSCQ=84.82;BUM=0;BUMQ=0.00;CAS=0;CASQ=0.00;CQ=1342.59;EVENT=gridss225_105;IC=0;IHOMPOS=-48,0;IQ=0.00;PARID=gridss225_105o;RAS=2;RASQ=744.76;REF=13;REFPAIR=11;RP=17;RPQ=309.80;SC=411M1X;SR=4;SRQ=115.45;SVTYPE=BND GT:ASQ:ASRP:ASSR:BAQ:BQ:BSC:BSCQ:BUM:BUMQ:CASQ:IC:IQ:QUAL:RASQ:REF:REFPAIR:RP:RPQ:SR:SRQ .:172.58:24:19:0.00:84.82:4:84.82:0:0.00:0.00:0:0.00:1342.59:744.76:13:11:17:309.80:4:115.45
transgene 4643 gridss176_2066h A ]12:113427284]A 4587.64 . AS=1;ASQ=1511.15;ASRP=72;ASSR=61;BA=0;BAQ=0.00;BEID=asm176-5709,asm273-803;BQ=67.98;BSC=3;BSCQ=67.98;BUM=0;BUMQ=0.00;CAS=0;CASQ=0.00;CIPOS=-1,0;CIRPOS=-1,0;CQ=4587.64;EVENT=gridss176_2066;HOMLEN=1;HOMSEQ=G;IC=0;IHOMPOS=-1,0;IQ=0.00;PARID=gridss176_2066o;RAS=1;RASQ=1633.33;REF=20;REFPAIR=16;RP=36;RPQ=670.05;SC=2X566M;SR=25;SRQ=773.12;SVTYPE=BND GT:ASQ:ASRP:ASSR:BAQ:BQ:BSC:BSCQ:BUM:BUMQ:CASQ:IC:IQ:QUAL:RASQ:REF:REFPAIR:RP:RPQ:SR:SRQ .:1511.15:72:61:0.00:67.98:3:67.98:0:0.00:0.00:0:0.00:4587.64:1633.33:20:16:36:670.05:25:773.12
transgene 6232 gridss103_460h G GACTTCAAGCTCCCGAATTCCT]7:36210528] 1491.70 . AS=1;ASQ=167.51;ASRP=22;ASSR=14;BA=0;BAQ=0.00;BEID=asm103-4518,asm273-429;BQ=0.00;BSC=0;BSCQ=0.00;BUM=0;BUMQ=0.00;CAS=0;CASQ=0.00;CQ=1491.70;EVENT=gridss103_460;IC=0;IHOMPOS=21,-21;IQ=0.00;PARID=gridss103_460o;RAS=1;RASQ=680.21;REF=1;REFPAIR=0;RP=15;RPQ=279.19;SC=151M8D447M1X;SR=11;SRQ=364.79;SVTYPE=BND GT:ASQ:ASRP:ASSR:BAQ:BQ:BSC:BSCQ:BUM:BUMQ:CASQ:IC:IQ:QUAL:RASQ:REF:REFPAIR:RP:RPQ:SR:SRQ .:167.51:22:14:0.00:0.00:0:0.00:0:0.00:0.00:0:0.00:1491.70:680.21:1:0:15:279.19:11:364.79
transgene 6232 gridss176_4432h G G[12:113428877[ 4062.86 . AS=1;ASQ=1699.40;ASRP=70;ASSR=57;BA=0;BAQ=0.00;BEID=asm176-19980,asm273-425;BQ=221.43;BSC=6;BSCQ=150.53;BUM=2;BUMQ=70.90;CAS=0;CASQ=0.00;CQ=3996.24;EVENT=gridss176_4432;IC=0;IHOMPOS=0,0;IQ=0.00;PARID=gridss176_4432o;RAS=1;RASQ=1253.83;REF=1;REFPAIR=0;RP=33;RPQ=614.21;SC=443M1D151M1X;SR=17;SRQ=495.42;SVTYPE=BND GT:ASQ:ASRP:ASSR:BAQ:BQ:BSC:BSCQ:BUM:BUMQ:CASQ:IC:IQ:QUAL:RASQ:REF:REFPAIR:RP:RPQ:SR:SRQ .:1699.40:70:57:0.00:221.43:6:150.53:2:70.90:0.00:0:0.00:4062.86:1253.83:1:0:33:614.21:17:495.42
I can see it looks better now than before, but I am not really sure why I see four contigs now on each side. Are these multiple possible insertion sites?
Since GRIDSS does breakend assembly, not breakpoint assembly, each clean breakpoint between A and B will have an assembly at A that also maps across to B and an assembly at B that also maps back to A. When viewing the assembly.sv.bam, the contigs will have a chimeric alignment and have split read alignment records at both A and B. Since there are two of them, you'll see 2 assemblies at A and 2 assemblies at B.
So, you've got insertions into chr12, and chr17. Presumably these are the masked regions so they won't be real transgene insertions, these will just be the locations of the mouse homolog.
Again the same question - do I have here multiple possible insertion points?
Doesn't look like it. You have two different insertion locations on chr7, but they're not very close to each other so it doesn't look like a simple clean insertion. You have:
7:28985942 -> transgene -> (untemplated inserted sequence of ACTTCAAGCTCCCGAATTCCT) ->7:36210528
BLASTing the assemblies in question gives nice clean alignments to those location so we quite good confidence in that part of the transgenic insertion. The question is: what happened around that? Have a look at the copy number on chr7. Did you actually lose 7Mb of chr7 where the transgene got inserted (that'd be obvious from even a .tdf coverage file generated using igvtools)? Are there any copy number transitions near the transgene chr7 breakpoint sites? Maybe there's a small extra fragment that got inserted with the transgene. Eg. maybe a few hundred/thousand bases of 7:28985942 sequence got inserted with the transgene at the 7:36210528 site (or maybe the opposite). Are there any breakpoints that go from somewhere near 7:28985942 to somewhere near 7:36210528?
Would this mean, that these four rows are showing one breakend assembly with an extra sequence inserted?
7 28985942 gridss102_6855o C [transgene:5[C 537.16 . AS=2;ASQ=212.64;ASRP=11;ASSR=6;BA=0;BAQ=0.00;BEID=asm102-21303,asm102-21312,asm273-801;BQ=49.49;BSC=2;BSCQ=49.49;BUM=0;BUMQ=0.00;CAS=0;CASQ=0.00;CIPOS=-2,0;CIRPOS=0,2;CQ=537.16;EVENT=gridss102_6855;HOMLEN=2;HOMSEQ=TC;IC=0;IHOMPOS=-1,0;IQ=0.00;PARID=gridss102_6855h;RAS=1;RASQ=161.37;REF=21;REFPAIR=18;RP=6;RPQ=111.67;SC=1X1N1X372M;SR=2;SRQ=51.47;SVTYPE=BND GT:ASQ:ASRP:ASSR:BAQ:BQ:BSC:BSCQ:BUM:BUMQ:CASQ:IC:IQ:QUAL:RASQ:REF:REFPAIR:RP:RPQ:SR:SRQ .:212.64:11:6:0.00:49.49:2:49.49:0:0.00:0.00:0:0.00:537.16:161.37:21:18:6:111.67:2:51.47
7 36210528 gridss103_460o G GAGGAATTCGGGAGCTTGAAGT]transgene:6232] 1491.70 . AS=1;ASQ=680.21;ASRP=22;ASSR=14;BA=0;BAQ=0.00;BEID=asm103-4518,asm273-429;BQ=73.46;BSC=3;BSCQ=73.46;BUM=0;BUMQ=0.00;CAS=0;CASQ=0.00;CQ=1491.70;EVENT=gridss103_460;IC=0;IHOMPOS=21,-21;IQ=0.00;PARID=gridss103_460h;RAS=1;RASQ=167.51;REF=32;REFPAIR=11;RP=15;RPQ=279.19;SC=151M98D356M1X;SR=11;SRQ=364.79;SVTYPE=BND GT:ASQ:ASRP:ASSR:BAQ:BQ:BSC:BSCQ:BUM:BUMQ:CASQ:IC:IQ:QUAL:RASQ:REF:REFPAIR:RP:RPQ:SR:SRQ .:680.21:22:14:0.00:73.46:3:73.46:0:0.00:0.00:0:0.00:1491.70:167.51:32:11:15:279.19:11:364.79
transgene 5 gridss102_6855h G [7:28985942[G 537.16 . AS=1;ASQ=161.37;ASRP=11;ASSR=6;BA=0;BAQ=0.00;BEID=asm102-21303,asm102-21312,asm273-801;BQ=0.00;BSC=0;BSCQ=0.00;BUM=0;BUMQ=0.00;CAS=0;CASQ=0.00;CIPOS=0,2;CIRPOS=-2,0;CQ=537.16;EVENT=gridss102_6855;HOMLEN=2;HOMSEQ=GA;IC=0;IHOMPOS=0,1;IQ=0.00;PARID=gridss102_6855o;RAS=2;RASQ=212.64;REF=2;REFPAIR=0;RP=6;RPQ=111.67;SC=1X1N1X243M3D171M;SR=2;SRQ=51.47;SVTYPE=BND GT:ASQ:ASRP:ASSR:BAQ:BQ:BSC:BSCQ:BUM:BUMQ:CASQ:IC:IQ:QUAL:RASQ:REF:REFPAIR:RP:RPQ:SR:SRQ .:161.37:11:6:0.00:0.00:0:0.00:0:0.00:0.00:0:0.00:537.16:212.64:2:0:6:111.67:2:51.47
transgene 6232 gridss103_460h G GACTTCAAGCTCCCGAATTCCT]7:36210528] 1491.70 . AS=1;ASQ=167.51;ASRP=22;ASSR=14;BA=0;BAQ=0.00;BEID=asm103-4518,asm273-429;BQ=0.00;BSC=0;BSCQ=0.00;BUM=0;BUMQ=0.00;CAS=0;CASQ=0.00;CQ=1491.70;EVENT=gridss103_460;IC=0;IHOMPOS=21,-21;IQ=0.00;PARID=gridss103_460o;RAS=1;RASQ=680.21;REF=1;REFPAIR=0;RP=15;RPQ=279.19;SC=151M8D447M1X;SR=11;SRQ=364.79;SVTYPE=BND GT:ASQ:ASRP:ASSR:BAQ:BQ:BSC:BSCQ:BUM:BUMQ:CASQ:IC:IQ:QUAL:RASQ:REF:REFPAIR:RP:RPQ:SR:SRQ .:167.51:22:14:0.00:0.00:0:0.00:0:0.00:0.00:0:0.00:1491.70:680.21:1:0:15:279.19:11:364.79
BLASTing the assemblies in question gives nice clean alignments to those location so we quite good confidence in that part of the transgenic insertion.
Are we talking here about chr12 and chr17? Or which assemblies?
To be honest, I'm not really sure what you mean here :-( It is all still quite confusing.
The question is: what happened around that? Have a look at the copy number on chr7. Did you actually lose 7Mb of chr7 where the transgene got inserted (that'd be obvious from even a .tdf coverage file generated using igvtools)?
To try and see this I'm converting the bam file to wig. When this is done I will use igvtools to create the tdf file and post the results here too.
Are there any copy number transitions near the transgene chr7 breakpoint sites? Maybe there's a small extra fragment that got inserted with the transgene. Eg. maybe a few hundred/thousand bases of 7:28985942 sequence got inserted with the transgene at the 7:36210528 site (or maybe the opposite).
Would this be visible in the gridss vcf result file? and if so, where can I see it? If it is in the same vcf file, than I don't think there are any. I pasted a snapshot of the two regions mentioned below. As far I can tell, the next event identified next to the possible transgene insertion position is another BND, but quite far off. I was inclined to say no, but than I took a look at all the hits identified in chr7. Here is a list of all hits between positions 7:28985942-36210528.
7 28985942 gridss102_6855o C [transgene:5[C 537.16 . AS=2;ASQ=212.64;ASRP=11;ASSR=6;BA=0;BAQ=0.00;BEID=asm102-21303,asm102-21312,asm273-801;BQ=49.49;BSC=2;BSCQ=49.49;BUM=0;BUMQ=0.00;CAS=0;CASQ=0.00;CIPOS=-2,0;CIRPOS=0,2;CQ=537.16;EVENT=gridss102_6855;HOMLEN=2;HOMSEQ=TC;IC=0;IHOMPOS=-1,0;IQ=0.00;PARID=gridss102_6855h;RAS=1;RASQ=161.37;REF=21;REFPAIR=18;RP=6;RPQ=111.67;SC=1X1N1X372M;SR=2;SRQ=51.47;SVTYPE=BND GT:ASQ:ASRP:ASSR:BAQ:BQ:BSC:BSCQ:BUM:BUMQ:CASQ:IC:IQ:QUAL:RASQ:REF:REFPAIR:RP:RPQ:SR:SRQ .:212.64:11:6:0.00:49.49:2:49.49:0:0.00:0.00:0:0.00:537.16:161.37:21:18:6:111.67:2:51.47
7 29002710 gridss102_3550o T T[7:36181051[ 2309.2 . AS=1;ASQ=786.99;ASRP=34;ASSR=31;BA=0;BAQ=0.00;BEID=asm102-8451,asm103-18360;BQ=69.27;BSC=3;BSCQ=69.27;BUM=0;BUMQ=0.00;CAS=0;CASQ=0.00;CIPOS=-2,0;CIRPOS=-2,0;CQ=2309.20;EVENT=gridss102_3550;HOMLEN=2;HOMSEQ=AT;IC=0;IHOMPOS=-2,0;IQ=0.00;PARID=gridss102_3550h;RAS=1;RASQ=761.91;REF=31;REFPAIR=18;RP=18;RPQ=335.02;SC=117M46D340M1X1N1X;SR=14;SRQ=425.28;SVTYPE=BND GT:ASQ:ASRP:ASSR:BAQ:BQ:BSC:BSCQ:BUM:BUMQ:CASQ:IC:IQ:QUAL:RASQ:REF:REFPAIR:RP:RPQ:SR:SRQ .:786.99:34:31:0.00:69.27:3:69.27:0:0.00:0.00:0:0.00:2309.20:761.91:31:18:18:335.02:14:425.28
7 31382711 gridss102_5924o G ]7:31386712]G 1356.06 . AS=1;ASQ=351.86;ASRP=49;ASSR=0;BA=0;BAQ=0.00;BEID=asm102-11832,asm102-24636;BQ=35.45;BSC=0;BSCQ=0.00;BUM=1;BUMQ=35.45;CAS=0;CASQ=0.00;CIPOS=-90,0;CIRPOS=-90,0;CQ=1356.06;EVENT=gridss102_5924;HOMLEN=90;HOMSEQ=TCTGTTGAGGGGCATCTAGGTTCTTTCCAGCTTCTGGCTATTATAAATAAGGCTGCTATGAACATAGTGGAGCATGTGTCCTTCTTACCA;IC=0;IHOMPOS=-284,158;IQ=0.00;PARID=gridss102_5924h;RAS=1;RASQ=799.46;REF=28;REFPAIR=11;RP=11;RPQ=204.74;SC=1X89N1X357M;SR=0;SRQ=0.00;SVTYPE=BND GT:ASQ:ASRP:ASSR:BAQ:BQ:BSC:BSCQ:BUM:BUMQ:CASQ:IC:IQ:QUAL:RASQ:REF:REFPAIR:RP:RPQ:SR:SRQ .:351.86:49:0:0.00:35.45:0:0.00:1:35.45:0.00:0:0.00:1356.06:799.46:28:11:11:204.74:0:0.00
7 31386712 gridss102_5924h A A[7:31382711[ 1356.06 . AS=1;ASQ=799.46;ASRP=49;ASSR=0;BA=0;BAQ=0.00;BEID=asm102-11832,asm102-24636;BQ=496.28;BSC=0;BSCQ=0.00;BUM=14;BUMQ=496.28;CAS=0;CASQ=0.00;CIPOS=-90,0;CIRPOS=-90,0;CQ=1356.06;EVENT=gridss102_5924;HOMLEN=90;HOMSEQ=TCTGTTGAGGGGCATCTAGGTTCTTTCCAGCTTCTGGCTATTATAAATAAGGCTGCTATGAACATAGTGGAGCATGTGTCCTTCTTACCA;IC=0;IHOMPOS=-284,158;IQ=0.00;PARID=gridss102_5924o;RAS=1;RASQ=351.86;REF=30;REFPAIR=15;RP=11;RPQ=204.74;SC=460M1X89N1X;SR=0;SRQ=0.00;SVTYPE=BND GT:ASQ:ASRP:ASSR:BAQ:BQ:BSC:BSCQ:BUM:BUMQ:CASQ:IC:IQ:QUAL:RASQ:REF:REFPAIR:RP:RPQ:SR:SRQ .:799.46:49:0:0.00:496.28:0:0.00:14:496.28:0.00:0:0.00:1356.06:351.86:30:15:11:204.74:0:0.00
7 35635809 gridss103_5728o T ]7:35635921]T 511.86 . AS=1;ASQ=331.14;ASRP=0;ASSR=20;BA=0;BAQ=0.00;BEID=asm103-17508,asm103-3635;BQ=152.34;BSC=7;BSCQ=152.34;BUM=0;BUMQ=0.00;CAS=0;CASQ=0.00;CIPOS=0,11;CIRPOS=0,11;CQ=511.86;EVENT=gridss103_5728;HOMLEN=11;HOMSEQ=TGTGTGTGTGT;IC=0;IHOMPOS=-28,65;IQ=0.00;PARID=gridss103_5728h;RAS=1;RASQ=152.00;REF=30;REFPAIR=11;RP=0;RPQ=0.00;SC=1X10N1X88M;SR=1;SRQ=28.73;SVTYPE=BND GT:ASQ:ASRP:ASSR:BAQ:BQ:BSC:BSCQ:BUM:BUMQ:CASQ:IC:IQ:QUAL:RASQ:REF:REFPAIR:RP:RPQ:SR:SRQ .:331.14:0:20:0.00:152.34:7:152.34:0:0.00:0.00:0:0.00:511.86:152.00:30:11:0:0.00:1:28.73
7 35635921 gridss103_5728h G G[7:35635809[ 511.86 . AS=1;ASQ=152.00;ASRP=0;ASSR=20;BA=0;BAQ=0.00;BEID=asm103-17508,asm103-3635;BQ=91.26;BSC=4;BSCQ=91.26;BUM=0;BUMQ=0.00;CAS=0;CASQ=0.00;CIPOS=0,11;CIRPOS=0,11;CQ=511.86;EVENT=gridss103_5728;HOMLEN=11;HOMSEQ=TGTGTGTGTGT;IC=0;IHOMPOS=-28,65;IQ=0.00;PARID=gridss103_5728o;RAS=1;RASQ=331.14;REF=7;REFPAIR=14;RP=0;RPQ=0.00;SC=88M1X10N1X;SR=1;SRQ=28.73;SVTYPE=BND GT:ASQ:ASRP:ASSR:BAQ:BQ:BSC:BSCQ:BUM:BUMQ:CASQ:IC:IQ:QUAL:RASQ:REF:REFPAIR:RP:RPQ:SR:SRQ .:152.00:0:20:0.00:91.26:4:91.26:0:0.00:0.00:0:0.00:511.86:331.14:7:14:0:0.00:1:28.73
7 36181051 gridss102_3550h G ]7:29002710]G 2309.2 . AS=1;ASQ=761.91;ASRP=34;ASSR=31;BA=0;BAQ=0.00;BEID=asm102-8451,asm103-18360;BQ=20.22;BSC=1;BSCQ=20.22;BUM=0;BUMQ=0.00;CAS=0;CASQ=0.00;CIPOS=-2,0;CIRPOS=-2,0;CQ=2309.20;EVENT=gridss102_3550;HOMLEN=2;HOMSEQ=AT;IC=0;IHOMPOS=-2,0;IQ=0.00;PARID=gridss102_3550o;RAS=1;RASQ=786.99;REF=46;REFPAIR=13;RP=18;RPQ=335.02;SC=1X1N1X547M;SR=14;SRQ=425.28;SVTYPE=BND GT:ASQ:ASRP:ASSR:BAQ:BQ:BSC:BSCQ:BUM:BUMQ:CASQ:IC:IQ:QUAL:RASQ:REF:REFPAIR:RP:RPQ:SR:SRQ .:761.91:34:31:0.00:20.22:1:20.22:0:0.00:0.00:0:0.00:2309.20:786.99:46:13:18:335.02:14:425.28
7 36210528 gridss103_460o G GAGGAATTCGGGAGCTTGAAGT]transgene:6232] 1491.7 . AS=1;ASQ=680.21;ASRP=22;ASSR=14;BA=0;BAQ=0.00;BEID=asm103-4518,asm273-429;BQ=73.46;BSC=3;BSCQ=73.46;BUM=0;BUMQ=0.00;CAS=0;CASQ=0.00;CQ=1491.70;EVENT=gridss103_460;IC=0;IHOMPOS=21,-21;IQ=0.00;PARID=gridss103_460h;RAS=1;RASQ=167.51;REF=32;REFPAIR=11;RP=15;RPQ=279.19;SC=151M98D356M1X;SR=11;SRQ=364.79;SVTYPE=BND GT:ASQ:ASRP:ASSR:BAQ:BQ:BSC:BSCQ:BUM:BUMQ:CASQ:IC:IQ:QUAL:RASQ:REF:REFPAIR:RP:RPQ:SR:SRQ .:680.21:22:14:0.00:73.46:3:73.46:0:0.00:0.00:0:0.00:1491.70:167.51:32:11:15:279.19:11:364.79
The first and last rows are the breakends with the transgene (the complete list can be found here). If I can interpret the results correctly, the event gridss102_3550
span almost across all the region between the two insertion positions.
Are there any breakpoints that go from somewhere near 7:28985942 to somewhere near 7:36210528?
7:28985942
7:36210528
I hope you can make more sense in it than I can. I am not even sure where to start for now. thanks a lot for all the help.
Are we talking here about chr12 and chr17? Or which assemblies? To be honest, I'm not really sure what you mean here :-( It is all still quite confusing.
The chr7 ones. I got the breakpoint contig sequences from the read information in the file you attached, BLASTed them to the mouse genome and confirmed that there was not any ambiguity in the insertion sites.
Would this be visible in the gridss vcf result file? and if so, where can I see it?
No, GRIDSS doesn't to CN calling (yet). You'll need to use a CNV calling tool for that.
You don't have a clean transgenic insertion, you're now at the point that you need work out what else got rearranged when the transgene got inserted.
7:28985942
That's a pretty clear copy number change between those breakpoints there.You have the DNA fragment 7:28985942-7:29002710 inserted with your transgene (connected to the start of the transgene).
Following gridss102_3550 from 7:29002710 leads to 7:36181051 and continuing on to the right until we get to 7:36210528
So: consider the following DNA fragments:
A: 7:start of chr to 28985941 B: 7:28985942-29002710 C: 7:29002711-36181050 D: 7:36181051-36210528 E: 7:36210529 to end of chr T: AGGAATTCGGGAGCTTGAAGT followed by transgene:6232-1 (note that transgene transcript is on the negative strand)
There are multiple possible transgene insertion configurations. You have either:
a) A circular double minute BDT
b) ABDTBCDE
c) ABCDTBDE
There's no way to tell which one it is from this data since we can't phase the three breakpoints involved in the insertion (gridss102_6855, gridss102_3550, gridss103_460).
All scenarios results in a copy number increase for DNA segments B and D. You should confirm this is the case with the .tdf you generate (you can generate one directly from the BAM file) or a CNV tool.
I hope you can make more sense in it than I can. I am not even sure where to start for now.
I recommend drawing out the DNA segments and breakpoint connections.
thanks a lot for all the help.
Not a problem. Always happy to be credited with helping with the analysis ;)
Hi, I am not sure, whether this is the right place to put such a question, but i am not sure how to proceed. I have done the gridss analysis and got loads of files and results. How can I make sense out of it? Is there a way to conveniently visualize or filter the results to get only specific Indels?
As mentioned in earlier questions, I am mainly interested in possible insertions of my transgene in the mouse genome. You mentioned Socrates for this kind of analysis, but is it possible to extract the same results from the gridss analysis. The transgene size is approx. 6.2Mb. Does it make sense to look for Indels of this size in the vcf file?
thanks Assa