davidebolo1993 / VISOR

VarIant SimulatOR for short, long and linked reads
GNU Lesser General Public License v3.0
41 stars 10 forks source link

How to know the genomic origin (position) of the simlulated reads? #11

Closed biozzq closed 4 years ago

biozzq commented 4 years ago

Dear @davidebolo1993

Sometimes, the simulated reads cannot be mapped to the reference genome exactly, thus, if we know the geonmic origin of the simulated reads, we can statistic the accuray of mapping. I found that the VISOR only kept the alignment file (BAM file) after simulation. Is it possible for us to calculate the accuray of mapping from the BAM file?

Best, Zheng Zhuqing

davidebolo1993 commented 4 years ago

Hi @biozzq,

For BAM files generated with SHORtS this is not a big issue, as wgsim stores the region used for simulating inside the read name. A quick bash one-liner can be:

samtools view in.bam | cut -f1,4 | awk -F"_" '$1=$1' OFS="\t" | awk '{if ($7 >= $2 && $7 <= $3)print "correct"; else print "wrong"}' OFS="\t" > results.txt

For LASeR, this is not that trivial, as PBSIM does not store such an information inside the read name. However, if the aim is to test the capability of the aligner (minimap2, for VISOR) to correctly map reads originating from a certain region, one can simply simulate reads from that region and then parse the final BAM to check how many reads map elsewhere. For instance, assuming you have a BAM file simulated with LASeR using the BED:

chr20   1   20000000    100.0   100.0

one can roughly check the mapping accuracy with:

while IFS=$'\t' read -r chrom start end purity capture; do 

    samtools view in.bam | cut -f3,4 | awk -v v1="$chrom" -v v2="$start" -v v3="$end" '{ if ("$1" == v1 && "$2">=v2 && "$2" <= v3) print "correct"; else print "wrong"}' > results.txt

done < in.bed

I hope this helps,

Davide

biozzq commented 4 years ago

Dear @davidebolo1993

I found that although PBSIM does not store the information inside the read name, but it will ouput the MAF file recording alignments between reference sequence (this reference has been modified by SVs) and simulated reads. We can use this information and the simulated SVs to infer the indeed positions on the reference genome (the genome used in the minimap2 mapping).

Best regards, Zheng Zhuqing