davidebolo1993 / VISOR

VarIant SimulatOR for short, long and linked reads
GNU Lesser General Public License v3.0
41 stars 11 forks source link

Assemblytics bed to HACk bed #19

Closed agolicz closed 2 years ago

agolicz commented 2 years ago

Hi, I have a .bed file produced by Assemblytics.

The coordinates are as follows: reference ref_start ref_stop type chrC01 16417864 16417924 Deletion chrC01 883429 883429 Insertion

As it is a .bed file I assume coordinates are 0-based. I am wondering how to convert Insertion coordinates to HACk BED, were for insertions column 2 is supposed to be: breakpoint-1 and column 3: breakpoint.

Should a corresponding HACk BED be?:

chrC01 16417864 16417924 deletion chrC01 883428 883429 insertion

Thanks!

Agnieszka

Here is a further example of Assemblytics bed: http://assemblytics.com/analysis.php?code=arabidopsis

davidebolo1993 commented 2 years ago

Hi @agolicz,

yes, in standard BED start coordinates are 0-based (end coordinates are not). For your example (assuming this is a "test.bed" file) from Assemblytics, something like the line below should work:

awk 'OFS=FS="\t"''{if($4 == "Insertion") print $1, $3-1, $3, $4; else print $0}' test.bed

Let me know if I can help further,

Davide

agolicz commented 2 years ago

Beautiful! Many thanks. One more question do you have a recommended tool to convert HACk bed to vcf?

davidebolo1993 commented 2 years ago

Hi @agolicz,

not a tool, but this script here can convert a HACk BED with standard SVs (DEL,DUP,INV,INS,BND) into a VCF that can be in turn used with downstream softwares like truvari for benchmarking. Usage is something like:

python converter.py <input.reference.fasta> <input.hack.bed> <output.hack.vcf>

Hope this helps,

Davide

agolicz commented 2 years ago

Perfect, exactly what I was after :).

Agnieszka

agolicz commented 2 years ago

Hi again, I am trying to convert the SHORtS output to fastq (we want to use different mappers), but something a bit strange seems to be happening.

First I tried samtools fastq, but most of the reads were discarded as singletons. The I tried bedtools bamtofastq but it complains (for some reads) that reads are marked as paired but there is no mate in the bam file.

Commands used: VISOR SHORtS -g Express617_v1.chrs.fa -s haplo -b shorts.laser.bed -o shorts.10x.out --threads 8 --coverage 10 --length 100 > shorts.10x.stdout 2>shorts.10x.err Finishes without isses. err file is empty tail -n 2 shorts.10x.stdout [05/10/2021 10:39:53][Message] Merging simulated BAM [05/10/2021 10:44:15][Message] Done

Attempted extraction with samtools: samtools view shorts.10x.bam | wc -l 75897493 samtools fastq -1 10x.shorts.R1.fq -2 10x.shorts.R2.fq -0 /dev/null -s /dev/null -n shorts.10x.bam [M::bam2fq_mainloop] discarded 75846243 singletons [M::bam2fq_mainloop] processed 75849457 reads

Attempted extraction with bedtools: samtools sort -n -o shorts.10x.qsort.bam shorts.10x.bam bamToFastq -i shorts.10x.qsort.bam -fq shorts.10x.R1.fq -fq2 shorts.10x.R2.fq *****WARNING: Query c1h1fh_chrA01_1_29974771|2319701|2320211|3:0:0|5:0:0|7a57b is marked as paired, but its mate does not occur next to it in your BAM file. Skipping.

Indeed when I just grep for the read name it occurs in the bam file only once. One would expect it twice for paired reads. view shorts.10x.bam | grep "c1h1fh_chrA01_1_29974771|2319701|2320211|3:0:0|5:0:0|7a57b" c1h1fh_chrA01_1_29974771|2319701|2320211|3:0:0|5:0:0|7a57b 73 chrA01 2332309 2 100M = 2332309 0 GATGGGTTACCAGATAAAGTAAAAAGTTTATTTGTAATTTCCAGATATGTTTAGATGGAGAGAGAGAGAGTGTGTGTATCTCTGTCAAGGGCGTAATTGG 2222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222 NM:i:3 ms:i:170 AS:i:170 nn:i:0 tp:A:P cm:i:2 s1:i:27 s2:i:0 de:f:0.03 MD:Z:21T32T25G19 rl:i:0 RG:Z:illumina

I also noticed that the reads which have no mate in bam have the same mapping location for both R1 and R2. I assume this is expected behavior, but now sure what it signifies.

Agnieszka

davidebolo1993 commented 2 years ago

Hi @agolicz,

I think this is an issue related to the names used by VISOR in read headers, which are modified to facilitate downstream benchmarking (and debugging, I must say). For the alignment step I think this is not a big issue, because of pairs being read in parallel from 2 (temporary) FASTQ files, but for re-building the initial FASTQ files I guess this can be problematic. If you can share your data, I can try to replicate the issue and dig a bit better into this one as soon as I have time.

Thanks,

Davide

agolicz commented 2 years ago

Thanks, I'll share the files later today. Do you think it would possible to have an option dump the simulated reads directly to fastq and have fastq along the bam. That would save the hassle of rebuilding...

If you are too busy now, I could modify my local copy and test if you could point me to the part of the code which should be changed.

davidebolo1993 commented 2 years ago

Yes, that' the idea, basically adding a --fastq flag that stores the fastq in the output folder. I think I can have a look into this in the next few days but feel free to test your ideas and get back with suggestions.

Thanks,

Davide

davidebolo1993 commented 2 years ago

Hi @agolicz,

attached a new version of VISOR.py and SHORtS.py. Can you give these a try? You need to un-tar the file attached and replace VISOR.py and SHORtS.py in your VISOR folder with the copies attached before re-running setup.py. SHORtS (and LASeR, same stuff, but will be added in the official release) now accepts a --fastq flag that dumps fastq files (r1.fq and r2.fq) to the output folder without changing the previous simulation model. Hopefully, these can be further re-used for testing other aligners. It works on a small test, but maybe needs some refining.

Thanks,

Davide newvers.tar.gz

agolicz commented 2 years ago

Thanks heaps. Testing now. I will let you know how it goes.

In the meantime I modified the BulkSim function to accept fastq file locations as two additional arguments and then dumped the reads there, but this is much nicer :)

bedtools bamtofastq seemed to have worked on LASeR bams without issues, so hopefully those should be fine.

agolicz commented 2 years ago

Seems to have worked fine!

I have one last question regarding coverages (genome size: 764629779 bp). For SHORtS coverage is exactly as I would expect it: For --coverage 10 wc -l r1.fq 152696444 r1.fq

float(152696444)/42100/764629779 9.984991965634652

Foe LASeR in the --covarege 5 simulation I got cat laser.5x.fq | paste - - - - | cut -f 2 | tr -d '\n' | wc -c

float(10376507526)/764629779 13.570629618389477

Is the coverage calculation different for long reads (not total bp/genome size)? I only noticed because LASeR fastq files are much bigger than SHORts.

Commands to get laser.5x.fq VISOR LASeR -g Express617_v1.chrs.fa -s haplo -b shorts.laser.bed -o laser.5x.out --threads 8 --coverage 5 > laser.5x.stdout 2>laser.5x.err bamToFastq -i laser.5x.bam -fq laser.5x.fq > btools.laser.5x.out 2>btools.laser.5x.err

Agnieszka

davidebolo1993 commented 2 years ago

Is(are) the contig(s) used for simulation same size of the reference genome? I've never seen such an issue with LASeR (more fluctuations than those observed with short reads though) and I'm honestly using this quite frequently.

Thanks,

Davide

agolicz commented 2 years ago

Yep. The reference genome is Express617_v1.chrs.fa. Same that was used for HACk

VISOR HACk -g Express617_v1.chrs.fa -b Assemblytics.INS.DEL.visor.uniq.no.chrs.bed -o haplo > HACk.log cut -f1,2 haplo/h1.fa.fai > haplochroms.dim.tsv awk 'OFS=FS="\t"''{print $1, "1", $2, "100.0", "100.0"}' haplochroms.dim.tsv > shorts.laser.bed

Total length Express617_v1.chrs.fa: 764629779

The simulated and reference haplotypes as seem similar.

t<-read.csv("shorts.laser.bed", sep="\t", header=F) sum(t$V3) [1] 763897743 t<-read.csv("Express617_v1.chrs.fa.fai", sep="\t", header=F) sum(t$V2) [1] 764629779

I have single haplotype because it's a plant and highly inbred. Is it possible that LASeR simulates reads as if there were two haplotypes and I would need to cut coverage in half?

ll haplo total 758429 drwxr-sr-x 2 agolicz agcpgl 2 Oct 5 10:00 ./ drwxr-sr-x 12 agolicz agcpgl 87 Oct 6 14:50 ../ -rw-r--r-- 1 agolicz agcpgl 776629534 Oct 5 10:00 h1.fa -rw-r--r-- 1 agolicz agcpgl 598 Oct 5 10:00 h1.fa.fai

davidebolo1993 commented 2 years ago

Just re-checked with a human chromosome and the simulated coverage is what I expected.

mkdir coveragetest && cd coveragetest
curl -LO ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome/GRCh38_full_analysis_set_plus_decoy_hla.fa
samtools faidx GRCh38_full_analysis_set_plus_decoy_hla.fa chr22 > chr22.fa
samtools faidx chr22.fa
echo -e "chr22\t15000000\t16000000\tdeletion\tNone\t0\nchr22\t20000000\t21000000\tinversion\tNone\t0\nchr22\t30000000\t31000000\ttandem duplication\t2\t0" > HACk.h1.bed
VISOR HACk -g chr22.fa -b HACk.h1.bed -o hack.1.out
cut -f1,2 hack.*.out/*.fai chr22.fa.fai > haplochroms.dim.tsv
cat haplochroms.dim.tsv | sort  | awk '$2 > maxvals[$1] {lines[$1]=$0; maxvals[$1]=$2} END { for (tag in lines) print lines[tag] }' > maxdims.tsv
awk 'OFS=FS="\t"''{print $1, "1", $2, "100.0", "100.0"}' maxdims.tsv > shorts.laser.simple.bed
VISOR LASeR -g chr22.fa -s hack.1.out -b shorts.laser.simple.bed -o laser.1.out --threads 7 --coverage 1 --tag --fastq
cat laser.1.out/r.fq | paste - - - - | cut -f 2 | tr -d '\n' | wc -c #50862344
#contig size is 50818468
#from ipython
50818468/50862344 = 0.99

Can't really say what's the issue here. Can you share a similar code snippet so that I can try to reproduce this weird behaviour?

agolicz commented 2 years ago

I must be doing something strange...

VISOR HACk -g Express617_v1.chrs.fa -b Assemblytics.INS.DEL.visor.uniq.no.chrs.bed -o haplo > HACk.log cut -f1,2 haplo/h1.fa.fai > haplochroms.dim.tsv awk 'OFS=FS="\t"''{print $1, "1", $2, "100.0", "100.0"}' haplochroms.dim.tsv > shorts.laser.bed VISOR LASeR -g Express617_v1.chrs.fa -s haplo -b shorts.laser.bed -o laser.5x.out --threads 8 --coverage 5 > laser.5x.stdout 2>laser.5x.err mv laser.5x.out/sim.srt.bam laser.5x.bam bamToFastq -i laser.5x.bam -fq laser.5x.fq > btools.laser.5x.out 2>btools.laser.5x.err cat laser.5x.fq | paste - - - - | cut -f 2 | tr -d '\n' | wc -c

The two files: http://sendanywhe.re/HH3597LW http://sendanywhe.re/XACV97J6

davidebolo1993 commented 2 years ago

Hi @agolicz, sorry for being late on this.

I honestly can't say what is happening on your side, because I double checked and I actually ended up having the proper coverage.

VISOR HACk -g Express617_v1.chrs.fa -b Assemblytics.INS.DEL.visor.uniq.no.chrs.bed -o haplo
cut -f1,2 haplo/*.fai Express617_v1.chrs.fa.fai > haplochroms.dim.tsv
cat haplochroms.dim.tsv | sort  | awk '$2 > maxvals[$1] {lines[$1]=$0; maxvals[$1]=$2} END { for (tag in lines) print lines[tag] }' > maxdims.tsv
awk 'OFS=FS="\t"''{print $1, "1", $2, "100.0", "100.0"}' maxdims.tsv > shorts.laser.simple.bed
totalbases=$(cut -f1,2 haplo/h1.fa.fai | cut -f2 | awk '{sum+=$1;} END{print sum;}')
VISOR LASeR -g Express617_v1.chrs.fa -s haplo -o laser.out --threads 5 --coverage 5 --fastq -b shorts.laser.simple.bed
#count coverage
#1 from fastq
simbases=$(cat laser.out/r.fq | paste - - - - | cut -f 2 | tr -d '\n' | wc -c)
echo $((simbases/ totalbases)) #returns 4.99
#2 from bam
mosdepth depth -n -x -b 500 laser.out/sim.srt.bam && column -t laser.out/depth.mosdepth.summary.txt #returns min/max/mean coverage per contig 

I am pretty sure that, especially for very low coverages ( < 1), you may have higher coverage fluctuations but this is not happening on this example (at least, not on my side).

Thanks,

Davide

davidebolo1993 commented 2 years ago

BTW, the LASeR version which accepts --fastq as input is included in the new tar attached.

newvers.tar.gz

I will push these changes to the master branch as soon as I hear back from you.

Thanks,

Davide

agolicz commented 2 years ago

Ok, great thanks! I will test out this one. It could be the bam to fastq conversion. So strange! Will re-run everything and get back to you.

Agnieszka

davidebolo1993 commented 2 years ago

Hi @agolicz,

I can confirm that bamToFastq is causing the issue (well, that's actually not an issue). I guess that bamToFastq do not account for multi-mappings (that is, reads that have a primary alignment as well as supplementary alignments) so that the same read is in the end extracted multiple times. A simple check on the output FASTQ from LASeR and bamToFastq unveil the mistery:

#true # of reads
grep "^@" laser.fq | wc -l #271879, in my case
# of reads from bamToFastq
grep "^@" bamtofastq.fq | wc -l #367849, in my case
#de-duplicating
grep "^@" bamtofastq.fq | sort | uniq | wc -l #271879, matching ground truth
agolicz commented 2 years ago

Ok, mystery solved! I suppose I did not expect that many reads to have secondary alignments... But it is a bit of a weird genome... How many secondary alignments do you normally see in the human genome? I will test the LASeR --fastq anyways and get back to you.

If you are pushing changes can I suggest one more thing? If there is overlapping regions in bed file HAcK gives the following message: 'xxx variants will be skipped for current chromosome as they overlap others'. It would be great if along the message the bed entries for the skipped variants were also printed. It helps with debugging issues with bed file and possibly producing a final vcf which reflects variants used for haplotype building.

Thanks for all the help. VISOR is a great tool and will help us immensely.

Agnieszka

davidebolo1993 commented 2 years ago

How many secondary alignments do you normally see in the human genome?

Quite a lot, especially in synthetic datasets containing a lot of SVs. This is also reported here, for instance, but it depends a lot on the parameters used for mapping I have to say.

It would be great if along the message the bed entries for the skipped variants were also printed. It helps with debugging issues with bed file and possibly producing a final vcf which reflects variants used for haplotype building.

I know but this is not as trivial as it seems, because of how variants are handled internally by HACk. For instance, a standard cut&paste translocation is translated internally into a deletion and an insertion that do not have a corresponding key (entry) in the original BED. A possible solution is to build a second dictionary where each key is the entry in the VCF and the corresponding value a tuple of variants derived from that entry but I'm honestly looking for something more elegant. I'll experiment with possible strategies as soon as I have time.

Thanks for all the help. VISOR is a great tool and will help us immensely.

Thanks.

Davide

agolicz commented 2 years ago

Just wanted to confirm that all seems to have worked well :)

Thanks! Agnieszka

davidebolo1993 commented 2 years ago

Hi @agolicz,

Thanks, I've just pushed the changes to the master branch.

Best,

Davide

agolicz commented 2 years ago

Hi @agolicz,

not a tool, but this script here can convert a HACk BED with standard SVs (DEL,DUP,INV,INS,BND) into a VCF that can be in turn used with downstream softwares like truvari for benchmarking. Usage is something like:

python converter.py <input.reference.fasta> <input.hack.bed> <output.hack.vcf>

Hope this helps,

Davide

Hi, I have a quick question about converter.py

I converted mv HACk bed to vcf, but I noticed that for while start position for insertions seems to get updated with +1, this is not the case for deletions. Is there a reason for that? Sample code snippet

python converter.py Express617_v1.chrs.fa Assemblytics.INS.DEL.visor.uniq.no.chrs.bed Assemblytics.INS.DEL.visor.uniq.no.chrs.vcf

#Insertion - different start
grep 139536 Assemblytics.INS.DEL.visor.uniq.no.chrs.bed
chrA01  139535  139536  insertion       TTTTTTAAAAAATTTTAAATTTATTCAAATAAAAAAAAAACTTTTGTACAGAGTTTTATCTGCTACCATATTTAAATACAATGAAATGCAAACCAAAAAAAAATATGCTTCTATATCTCTCAAGCTATACTAGCCTCTGTTAAACCAATCCTCCATCATCTTCTCATATTTACCTCTTACCCTCCTTCTCAAAGAAGATATTCTGTTTCTAATGAGCTTATCAAGCCTTGCAATCAGACAGGTAGCTGGCTGCGATGCCTCACCCACTCTTCTCACATTCCTCTCATGCCATAACGCATAAGCCACGGCCTGAAAGCAATAACGCAGTAACACAGTCGAACTCCACTTATGTAAACCCTTCTCAACAATCCGAAGCACTCTATCCCATTTAC  0
grep 139536 Assemblytics.INS.DEL.visor.uniq.no.chrs.vcf
chrA01  139536  var1    N       <INS>   .       PASS    SVTYPE=INS;END=139536;SVLEN=392 GT      1/1

#Deletion - same start
grep 161804 Assemblytics.INS.DEL.visor.uniq.no.chrs.bed
chrA01  161557  161804  deletion        None    0
grep 161804 Assemblytics.INS.DEL.visor.uniq.no.chrs.vcf
chrA01  161557  var2    N       <DEL>   .       PASS    SVTYPE=DEL;END=161804;SVLEN=-247        GT      1/1
agolicz commented 2 years ago

I seem to have gone down coordinate rabbit hole a bit and noticed that different tools seem to handle bed SV coordinates a bit differently... Just to confirm, for the two attached SVs I would expect the coordinates to be: Insertion: chr1S 754740078 754740079 Deletion: chr1S 754768762 754768819

Is that correct?

Sample_insertion Sample_deletion
davidebolo1993 commented 2 years ago

Hi @agolicz,

Because in VISOR INS events are added immediately after the END coordinate specified in your BED, in the corresponding VCF event you should see the END coordinate as POS (which is exactly what I see in your example here). For DEL events, POS and END are the first and last deleted bases.

Best,

Davide

agolicz commented 2 years ago

Yes thanks, I think you are correct. The INS position got me confused. Could you double check if my INS and DEL coordinates based on IGV are correct? Just to make sure I get it now...

Agnieszka

davidebolo1993 commented 2 years ago

Yes,

they look good to me. Let me know if I can help further.

Thanks,

Davide

agolicz commented 2 years ago

Brilliant thanks. I think that should do it!

Agnieszka

agolicz commented 2 years ago

Also as a note if anyone ever encounters this thread for the variants pictured above:

Assemblytics output (Assemblytics_structural_variants.bed):

chr1S   754740080   754740080   Assemblytics_w_183750   25377   +   Insertion   0   25377   ptg000015l:1377388-1402765:+    within_alignment
chr1S   754768763   754768820   Assemblytics_w_183752   57  +   Deletion    57  0   ptg000015l:1419021-1419021:+    within_alignment

Corresponding vcf:

chr1S   754740079   svim_asm.INS.33071  C   N   .   PASS    SVTYPE=INS;END=754740079;SVLEN=25377    GT  1/1
chr1S   754768762   svim_asm.DEL.34093  N   A   .   PASS    SVTYPE=DEL;END=754768819;SVLEN=-57  GT  1/1

Corresponding HACk bed coordinates:

chr1S   754740078   754740079
chr1S   754768762   754768819