Closed agolicz closed 2 years ago
Hi @agolicz,
yes, in standard BED start coordinates are 0-based (end coordinates are not). For your example (assuming this is a "test.bed" file) from Assemblytics, something like the line below should work:
awk 'OFS=FS="\t"''{if($4 == "Insertion") print $1, $3-1, $3, $4; else print $0}' test.bed
Let me know if I can help further,
Davide
Beautiful! Many thanks. One more question do you have a recommended tool to convert HACk bed to vcf?
Hi @agolicz,
not a tool, but this script here can convert a HACk BED with standard SVs (DEL,DUP,INV,INS,BND) into a VCF that can be in turn used with downstream softwares like truvari for benchmarking. Usage is something like:
python converter.py <input.reference.fasta> <input.hack.bed> <output.hack.vcf>
Hope this helps,
Davide
Perfect, exactly what I was after :).
Agnieszka
Hi again, I am trying to convert the SHORtS output to fastq (we want to use different mappers), but something a bit strange seems to be happening.
First I tried samtools fastq, but most of the reads were discarded as singletons. The I tried bedtools bamtofastq but it complains (for some reads) that reads are marked as paired but there is no mate in the bam file.
Commands used: VISOR SHORtS -g Express617_v1.chrs.fa -s haplo -b shorts.laser.bed -o shorts.10x.out --threads 8 --coverage 10 --length 100 > shorts.10x.stdout 2>shorts.10x.err Finishes without isses. err file is empty tail -n 2 shorts.10x.stdout [05/10/2021 10:39:53][Message] Merging simulated BAM [05/10/2021 10:44:15][Message] Done
Attempted extraction with samtools: samtools view shorts.10x.bam | wc -l 75897493 samtools fastq -1 10x.shorts.R1.fq -2 10x.shorts.R2.fq -0 /dev/null -s /dev/null -n shorts.10x.bam [M::bam2fq_mainloop] discarded 75846243 singletons [M::bam2fq_mainloop] processed 75849457 reads
Attempted extraction with bedtools: samtools sort -n -o shorts.10x.qsort.bam shorts.10x.bam bamToFastq -i shorts.10x.qsort.bam -fq shorts.10x.R1.fq -fq2 shorts.10x.R2.fq *****WARNING: Query c1h1fh_chrA01_1_29974771|2319701|2320211|3:0:0|5:0:0|7a57b is marked as paired, but its mate does not occur next to it in your BAM file. Skipping.
Indeed when I just grep for the read name it occurs in the bam file only once. One would expect it twice for paired reads. view shorts.10x.bam | grep "c1h1fh_chrA01_1_29974771|2319701|2320211|3:0:0|5:0:0|7a57b" c1h1fh_chrA01_1_29974771|2319701|2320211|3:0:0|5:0:0|7a57b 73 chrA01 2332309 2 100M = 2332309 0 GATGGGTTACCAGATAAAGTAAAAAGTTTATTTGTAATTTCCAGATATGTTTAGATGGAGAGAGAGAGAGTGTGTGTATCTCTGTCAAGGGCGTAATTGG 2222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222 NM:i:3 ms:i:170 AS:i:170 nn:i:0 tp:A:P cm:i:2 s1:i:27 s2:i:0 de:f:0.03 MD:Z:21T32T25G19 rl:i:0 RG:Z:illumina
I also noticed that the reads which have no mate in bam have the same mapping location for both R1 and R2. I assume this is expected behavior, but now sure what it signifies.
Agnieszka
Hi @agolicz,
I think this is an issue related to the names used by VISOR in read headers, which are modified to facilitate downstream benchmarking (and debugging, I must say). For the alignment step I think this is not a big issue, because of pairs being read in parallel from 2 (temporary) FASTQ files, but for re-building the initial FASTQ files I guess this can be problematic. If you can share your data, I can try to replicate the issue and dig a bit better into this one as soon as I have time.
Thanks,
Davide
Thanks, I'll share the files later today. Do you think it would possible to have an option dump the simulated reads directly to fastq and have fastq along the bam. That would save the hassle of rebuilding...
If you are too busy now, I could modify my local copy and test if you could point me to the part of the code which should be changed.
Yes, that' the idea, basically adding a --fastq flag that stores the fastq in the output folder. I think I can have a look into this in the next few days but feel free to test your ideas and get back with suggestions.
Thanks,
Davide
Hi @agolicz,
attached a new version of VISOR.py and SHORtS.py. Can you give these a try? You need to un-tar the file attached and replace VISOR.py and SHORtS.py in your VISOR folder with the copies attached before re-running setup.py. SHORtS (and LASeR, same stuff, but will be added in the official release) now accepts a --fastq flag that dumps fastq files (r1.fq and r2.fq) to the output folder without changing the previous simulation model. Hopefully, these can be further re-used for testing other aligners. It works on a small test, but maybe needs some refining.
Thanks,
Davide newvers.tar.gz
Thanks heaps. Testing now. I will let you know how it goes.
In the meantime I modified the BulkSim function to accept fastq file locations as two additional arguments and then dumped the reads there, but this is much nicer :)
bedtools bamtofastq seemed to have worked on LASeR bams without issues, so hopefully those should be fine.
Seems to have worked fine!
I have one last question regarding coverages (genome size: 764629779 bp). For SHORtS coverage is exactly as I would expect it: For --coverage 10 wc -l r1.fq 152696444 r1.fq
float(152696444)/42100/764629779 9.984991965634652
Foe LASeR in the --covarege 5 simulation I got cat laser.5x.fq | paste - - - - | cut -f 2 | tr -d '\n' | wc -c
float(10376507526)/764629779 13.570629618389477
Is the coverage calculation different for long reads (not total bp/genome size)? I only noticed because LASeR fastq files are much bigger than SHORts.
Commands to get laser.5x.fq VISOR LASeR -g Express617_v1.chrs.fa -s haplo -b shorts.laser.bed -o laser.5x.out --threads 8 --coverage 5 > laser.5x.stdout 2>laser.5x.err bamToFastq -i laser.5x.bam -fq laser.5x.fq > btools.laser.5x.out 2>btools.laser.5x.err
Agnieszka
Is(are) the contig(s) used for simulation same size of the reference genome? I've never seen such an issue with LASeR (more fluctuations than those observed with short reads though) and I'm honestly using this quite frequently.
Thanks,
Davide
Yep. The reference genome is Express617_v1.chrs.fa. Same that was used for HACk
VISOR HACk -g Express617_v1.chrs.fa -b Assemblytics.INS.DEL.visor.uniq.no.chrs.bed -o haplo > HACk.log cut -f1,2 haplo/h1.fa.fai > haplochroms.dim.tsv awk 'OFS=FS="\t"''{print $1, "1", $2, "100.0", "100.0"}' haplochroms.dim.tsv > shorts.laser.bed
Total length Express617_v1.chrs.fa: 764629779
The simulated and reference haplotypes as seem similar.
t<-read.csv("shorts.laser.bed", sep="\t", header=F) sum(t$V3) [1] 763897743 t<-read.csv("Express617_v1.chrs.fa.fai", sep="\t", header=F) sum(t$V2) [1] 764629779
I have single haplotype because it's a plant and highly inbred. Is it possible that LASeR simulates reads as if there were two haplotypes and I would need to cut coverage in half?
ll haplo total 758429 drwxr-sr-x 2 agolicz agcpgl 2 Oct 5 10:00 ./ drwxr-sr-x 12 agolicz agcpgl 87 Oct 6 14:50 ../ -rw-r--r-- 1 agolicz agcpgl 776629534 Oct 5 10:00 h1.fa -rw-r--r-- 1 agolicz agcpgl 598 Oct 5 10:00 h1.fa.fai
Just re-checked with a human chromosome and the simulated coverage is what I expected.
mkdir coveragetest && cd coveragetest
curl -LO ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome/GRCh38_full_analysis_set_plus_decoy_hla.fa
samtools faidx GRCh38_full_analysis_set_plus_decoy_hla.fa chr22 > chr22.fa
samtools faidx chr22.fa
echo -e "chr22\t15000000\t16000000\tdeletion\tNone\t0\nchr22\t20000000\t21000000\tinversion\tNone\t0\nchr22\t30000000\t31000000\ttandem duplication\t2\t0" > HACk.h1.bed
VISOR HACk -g chr22.fa -b HACk.h1.bed -o hack.1.out
cut -f1,2 hack.*.out/*.fai chr22.fa.fai > haplochroms.dim.tsv
cat haplochroms.dim.tsv | sort | awk '$2 > maxvals[$1] {lines[$1]=$0; maxvals[$1]=$2} END { for (tag in lines) print lines[tag] }' > maxdims.tsv
awk 'OFS=FS="\t"''{print $1, "1", $2, "100.0", "100.0"}' maxdims.tsv > shorts.laser.simple.bed
VISOR LASeR -g chr22.fa -s hack.1.out -b shorts.laser.simple.bed -o laser.1.out --threads 7 --coverage 1 --tag --fastq
cat laser.1.out/r.fq | paste - - - - | cut -f 2 | tr -d '\n' | wc -c #50862344
#contig size is 50818468
#from ipython
50818468/50862344 = 0.99
Can't really say what's the issue here. Can you share a similar code snippet so that I can try to reproduce this weird behaviour?
I must be doing something strange...
VISOR HACk -g Express617_v1.chrs.fa -b Assemblytics.INS.DEL.visor.uniq.no.chrs.bed -o haplo > HACk.log cut -f1,2 haplo/h1.fa.fai > haplochroms.dim.tsv awk 'OFS=FS="\t"''{print $1, "1", $2, "100.0", "100.0"}' haplochroms.dim.tsv > shorts.laser.bed VISOR LASeR -g Express617_v1.chrs.fa -s haplo -b shorts.laser.bed -o laser.5x.out --threads 8 --coverage 5 > laser.5x.stdout 2>laser.5x.err mv laser.5x.out/sim.srt.bam laser.5x.bam bamToFastq -i laser.5x.bam -fq laser.5x.fq > btools.laser.5x.out 2>btools.laser.5x.err cat laser.5x.fq | paste - - - - | cut -f 2 | tr -d '\n' | wc -c
The two files: http://sendanywhe.re/HH3597LW http://sendanywhe.re/XACV97J6
Hi @agolicz, sorry for being late on this.
I honestly can't say what is happening on your side, because I double checked and I actually ended up having the proper coverage.
VISOR HACk -g Express617_v1.chrs.fa -b Assemblytics.INS.DEL.visor.uniq.no.chrs.bed -o haplo
cut -f1,2 haplo/*.fai Express617_v1.chrs.fa.fai > haplochroms.dim.tsv
cat haplochroms.dim.tsv | sort | awk '$2 > maxvals[$1] {lines[$1]=$0; maxvals[$1]=$2} END { for (tag in lines) print lines[tag] }' > maxdims.tsv
awk 'OFS=FS="\t"''{print $1, "1", $2, "100.0", "100.0"}' maxdims.tsv > shorts.laser.simple.bed
totalbases=$(cut -f1,2 haplo/h1.fa.fai | cut -f2 | awk '{sum+=$1;} END{print sum;}')
VISOR LASeR -g Express617_v1.chrs.fa -s haplo -o laser.out --threads 5 --coverage 5 --fastq -b shorts.laser.simple.bed
#count coverage
#1 from fastq
simbases=$(cat laser.out/r.fq | paste - - - - | cut -f 2 | tr -d '\n' | wc -c)
echo $((simbases/ totalbases)) #returns 4.99
#2 from bam
mosdepth depth -n -x -b 500 laser.out/sim.srt.bam && column -t laser.out/depth.mosdepth.summary.txt #returns min/max/mean coverage per contig
I am pretty sure that, especially for very low coverages ( < 1), you may have higher coverage fluctuations but this is not happening on this example (at least, not on my side).
Thanks,
Davide
BTW, the LASeR version which accepts --fastq as input is included in the new tar attached.
I will push these changes to the master branch as soon as I hear back from you.
Thanks,
Davide
Ok, great thanks! I will test out this one. It could be the bam to fastq conversion. So strange! Will re-run everything and get back to you.
Agnieszka
Hi @agolicz,
I can confirm that bamToFastq is causing the issue (well, that's actually not an issue). I guess that bamToFastq do not account for multi-mappings (that is, reads that have a primary alignment as well as supplementary alignments) so that the same read is in the end extracted multiple times. A simple check on the output FASTQ from LASeR and bamToFastq unveil the mistery:
#true # of reads
grep "^@" laser.fq | wc -l #271879, in my case
# of reads from bamToFastq
grep "^@" bamtofastq.fq | wc -l #367849, in my case
#de-duplicating
grep "^@" bamtofastq.fq | sort | uniq | wc -l #271879, matching ground truth
Ok, mystery solved! I suppose I did not expect that many reads to have secondary alignments... But it is a bit of a weird genome... How many secondary alignments do you normally see in the human genome? I will test the LASeR --fastq anyways and get back to you.
If you are pushing changes can I suggest one more thing? If there is overlapping regions in bed file HAcK gives the following message: 'xxx variants will be skipped for current chromosome as they overlap others'. It would be great if along the message the bed entries for the skipped variants were also printed. It helps with debugging issues with bed file and possibly producing a final vcf which reflects variants used for haplotype building.
Thanks for all the help. VISOR is a great tool and will help us immensely.
Agnieszka
How many secondary alignments do you normally see in the human genome?
Quite a lot, especially in synthetic datasets containing a lot of SVs. This is also reported here, for instance, but it depends a lot on the parameters used for mapping I have to say.
It would be great if along the message the bed entries for the skipped variants were also printed. It helps with debugging issues with bed file and possibly producing a final vcf which reflects variants used for haplotype building.
I know but this is not as trivial as it seems, because of how variants are handled internally by HACk. For instance, a standard cut&paste translocation is translated internally into a deletion and an insertion that do not have a corresponding key (entry) in the original BED. A possible solution is to build a second dictionary where each key is the entry in the VCF and the corresponding value a tuple of variants derived from that entry but I'm honestly looking for something more elegant. I'll experiment with possible strategies as soon as I have time.
Thanks for all the help. VISOR is a great tool and will help us immensely.
Thanks.
Davide
Just wanted to confirm that all seems to have worked well :)
Thanks! Agnieszka
Hi @agolicz,
Thanks, I've just pushed the changes to the master branch.
Best,
Davide
Hi @agolicz,
not a tool, but this script here can convert a HACk BED with standard SVs (DEL,DUP,INV,INS,BND) into a VCF that can be in turn used with downstream softwares like truvari for benchmarking. Usage is something like:
python converter.py <input.reference.fasta> <input.hack.bed> <output.hack.vcf>
Hope this helps,
Davide
Hi, I have a quick question about converter.py
I converted mv HACk bed to vcf, but I noticed that for while start position for insertions seems to get updated with +1, this is not the case for deletions. Is there a reason for that? Sample code snippet
python converter.py Express617_v1.chrs.fa Assemblytics.INS.DEL.visor.uniq.no.chrs.bed Assemblytics.INS.DEL.visor.uniq.no.chrs.vcf
#Insertion - different start
grep 139536 Assemblytics.INS.DEL.visor.uniq.no.chrs.bed
chrA01 139535 139536 insertion TTTTTTAAAAAATTTTAAATTTATTCAAATAAAAAAAAAACTTTTGTACAGAGTTTTATCTGCTACCATATTTAAATACAATGAAATGCAAACCAAAAAAAAATATGCTTCTATATCTCTCAAGCTATACTAGCCTCTGTTAAACCAATCCTCCATCATCTTCTCATATTTACCTCTTACCCTCCTTCTCAAAGAAGATATTCTGTTTCTAATGAGCTTATCAAGCCTTGCAATCAGACAGGTAGCTGGCTGCGATGCCTCACCCACTCTTCTCACATTCCTCTCATGCCATAACGCATAAGCCACGGCCTGAAAGCAATAACGCAGTAACACAGTCGAACTCCACTTATGTAAACCCTTCTCAACAATCCGAAGCACTCTATCCCATTTAC 0
grep 139536 Assemblytics.INS.DEL.visor.uniq.no.chrs.vcf
chrA01 139536 var1 N <INS> . PASS SVTYPE=INS;END=139536;SVLEN=392 GT 1/1
#Deletion - same start
grep 161804 Assemblytics.INS.DEL.visor.uniq.no.chrs.bed
chrA01 161557 161804 deletion None 0
grep 161804 Assemblytics.INS.DEL.visor.uniq.no.chrs.vcf
chrA01 161557 var2 N <DEL> . PASS SVTYPE=DEL;END=161804;SVLEN=-247 GT 1/1
I seem to have gone down coordinate rabbit hole a bit and noticed that different tools seem to handle bed SV coordinates a bit differently... Just to confirm, for the two attached SVs I would expect the coordinates to be: Insertion: chr1S 754740078 754740079 Deletion: chr1S 754768762 754768819
Is that correct?
Hi @agolicz,
Because in VISOR INS events are added immediately after the END coordinate specified in your BED, in the corresponding VCF event you should see the END coordinate as POS (which is exactly what I see in your example here). For DEL events, POS and END are the first and last deleted bases.
Best,
Davide
Yes thanks, I think you are correct. The INS position got me confused. Could you double check if my INS and DEL coordinates based on IGV are correct? Just to make sure I get it now...
Agnieszka
Yes,
they look good to me. Let me know if I can help further.
Thanks,
Davide
Brilliant thanks. I think that should do it!
Agnieszka
Also as a note if anyone ever encounters this thread for the variants pictured above:
Assemblytics output (Assemblytics_structural_variants.bed):
chr1S 754740080 754740080 Assemblytics_w_183750 25377 + Insertion 0 25377 ptg000015l:1377388-1402765:+ within_alignment
chr1S 754768763 754768820 Assemblytics_w_183752 57 + Deletion 57 0 ptg000015l:1419021-1419021:+ within_alignment
Corresponding vcf:
chr1S 754740079 svim_asm.INS.33071 C N . PASS SVTYPE=INS;END=754740079;SVLEN=25377 GT 1/1
chr1S 754768762 svim_asm.DEL.34093 N A . PASS SVTYPE=DEL;END=754768819;SVLEN=-57 GT 1/1
Corresponding HACk bed coordinates:
chr1S 754740078 754740079
chr1S 754768762 754768819
Hi, I have a .bed file produced by Assemblytics.
The coordinates are as follows: reference ref_start ref_stop type chrC01 16417864 16417924 Deletion chrC01 883429 883429 Insertion
As it is a .bed file I assume coordinates are 0-based. I am wondering how to convert Insertion coordinates to HACk BED, were for insertions column 2 is supposed to be: breakpoint-1 and column 3: breakpoint.
Should a corresponding HACk BED be?:
chrC01 16417864 16417924 deletion chrC01 883428 883429 insertion
Thanks!
Agnieszka
Here is a further example of Assemblytics bed: http://assemblytics.com/analysis.php?code=arabidopsis