Closed biozzq closed 4 years ago
Hi @biozzq,
For BAM files generated with SHORtS this is not a big issue, as wgsim stores the region used for simulating inside the read name. A quick bash one-liner can be:
samtools view in.bam | cut -f1,4 | awk -F"_" '$1=$1' OFS="\t" | awk '{if ($7 >= $2 && $7 <= $3)print "correct"; else print "wrong"}' OFS="\t" > results.txt
For LASeR, this is not that trivial, as PBSIM does not store such an information inside the read name. However, if the aim is to test the capability of the aligner (minimap2, for VISOR) to correctly map reads originating from a certain region, one can simply simulate reads from that region and then parse the final BAM to check how many reads map elsewhere. For instance, assuming you have a BAM file simulated with LASeR using the BED:
chr20 1 20000000 100.0 100.0
one can roughly check the mapping accuracy with:
while IFS=$'\t' read -r chrom start end purity capture; do
samtools view in.bam | cut -f3,4 | awk -v v1="$chrom" -v v2="$start" -v v3="$end" '{ if ("$1" == v1 && "$2">=v2 && "$2" <= v3) print "correct"; else print "wrong"}' > results.txt
done < in.bed
I hope this helps,
Davide
Dear @davidebolo1993
I found that although PBSIM does not store the information inside the read name, but it will ouput the MAF file recording alignments between reference sequence (this reference has been modified by SVs) and simulated reads. We can use this information and the simulated SVs to infer the indeed positions on the reference genome (the genome used in the minimap2 mapping).
Best regards, Zheng Zhuqing
Dear @davidebolo1993
Sometimes, the simulated reads cannot be mapped to the reference genome exactly, thus, if we know the geonmic origin of the simulated reads, we can statistic the accuray of mapping. I found that the VISOR only kept the alignment file (BAM file) after simulation. Is it possible for us to calculate the accuray of mapping from the BAM file?
Best, Zheng Zhuqing