Error in soft-clipped location calculation

jeremymchacon commented 1 year ago

Hello,

I discovered a couple of bugs when working with this repo, in the perl scripts which detect the soft-clipped reads and output those as well as their position. The soft-clipped positions on the + strand are always off by a small amount, and the positions on the - strand are occassionally off by a variable amount. Below I describe the problem and what I think should fix it.

The problem is with generating the position of the soft-clipping (aka the poly-A location), in detect_from_soft_cliping_with_bam_file_polyT_beginning_based_on_S.pl, lines 59 and 117. These should not have +$index, because the position of the soft-clipped bases at the start of the read is the read POS from the sam file, which starts at the first MATCHED base (rather than the first matched base minus the number of preceding soft-clipped bases).

There is a related problem with detect_from_soft_cliping_with_bam_file_polyA_end_based_on_S.pl lines 57, 114. These currently treat the poly-A position as the position of the first match (the sam POS), plus the read length, minus the soft-clipped length at the end of the read. However, if there is soft-clipping at the beginning of the read, then using the total read length is misleading. Instead, a different way to get to the correct position is to add the sam POS to the number of matched bases plus the number of inserted bases (which should be rare anyways).

Hope this is useful! Thanks!

RENXI-NUS commented 1 year ago

Hi Jeremy,

Sorry for the late reply. As shown in the perl script, only the reads that are having at least three soft-clipping bases (in Cigar string if ($field1[5] =~ /^(([3-9]|[1-9]\d)S)/){ ) and three thymine (in the 5' end of read sequence if ($field1[9] =~ /^(T)\1{3,40}/){ ) would be kept in the soft-clipping read file. The polyT location (- strand) should be at the soft-clipping cleavage site. So you should have the +$index. That is similar for the polyA location (+ strand) prediction. Hope this clarifies.

jeremymchacon commented 1 year ago

Hello,

Thanks for the response. I think I may not have been clear, so please let me clarify.

In a sam file, the POS field (field 3) is the position of the first matched base in the sequence. Since the soft-clipped part of the sequence did not match, these bases are not included when calculating POS. Put differently, in the diagram pasted in your message showing the TTTTT and the X, since the TTTTT didn't match, they will not be part of the calculation, and the POS will be the actual location where you have put X. In this example, $index = 5, since there are 5 soft-clipped bases. By adding $index to $field1[3] (POS), the script is placing the soft-clipped position 5 bases to the right of the X, which is incorrect, at least by my understanding of the process. The location of the soft-clipping is simply $field1[3]. I have added this info to the diagram to hopefully clarify.

IPA_site_diagram

The related issue with poly-A sites occurs if there is additionally soft-clipping at the beginning of the read (not the poly-A side). These should not be included in the IPA site calculation, because they don't match the genome, and so the poly-A site position should be the first matched position (field 3) plus the amount of matched bases. This is the sequence length minus the length of soft-clipping on both sides.

I've written a python script which solves these problems, let me know if you'd like to use it.

Best, Jeremy

RENXI-NUS commented 1 year ago

Hi Jeremy,

Thanks a lot for pointing out these problems. Yes, you are right. I have revised the Perl scripts accordingly. Appreciate it very much.

RENXI-NUS / IPA-neoantigen-prediction

Error in soft-clipped location calculation #4