elzbth / jitterbug

Jitterbug is a bioinformatic software that predicts insertion sites of transposable elements in a sample sequenced by short paired-end reads with respect to an assembled reference.
17 stars 8 forks source link

No softclipped support #5

Closed choudhuryroy closed 7 years ago

choudhuryroy commented 8 years ago

Hi,

I am trying to run jitterbug by running: jitterbug.py -n 10 -b 50000000 -c 4 -q 35 -t Family myaligned.bam TE_annot.gff

After the run is over, I get the following files with the details of TE insertions: jitterbug.TE_insertions_paired_clusters.gff3 and jitterbug.TE_insertions_paired_clusters.supporting_clusters.table

In total, it finds 271 insertions, and I wanted to filter them based on cluster span and softclipped support.

However, looking at the gff file I found that none of these insertions have softclipped support.

CM002870.1      jitterbug       TE_insertion    6952053 6952271 .       .       .       supporting_fwd_reads=8; supporting_rev_reads=4; cluster_pair_ID=10; lib=None; Inserted_TE_tags_fwd=Clust3778, Clust2088, Clust1037; Inserted_TE_tags_rev=Clust3778; fwd_cluster_span=171; rev_cluster_span=8; softclipped_pos=(-1, -1); softclipped_support=0; het_core_reads=-1; zygosity=-1.000
CM002870.1      jitterbug       TE_insertion    9475320 9475381 .       .       .       supporting_fwd_reads=7; supporting_rev_reads=7; cluster_pair_ID=11; lib=None; Inserted_TE_tags_fwd=Clust2580; Inserted_TE_tags_rev=Clust2580; fwd_cluster_span=0; rev_cluster_span=57; softclipped_pos=(-1, -1); softclipped_support=0; het_core_reads=-1; zygosity=-1.000

I generated the bam files using bwa mem without the -M flag.

I wanted to be sure that the alignment tool and option are correct. Can you help me solve this issue? Thanks.

mbosio85 commented 8 years ago

Hi,

The code checking for softclipped read is the following:

def calc_anchor_is_softclipped(self, read1, read2):
    #if the read is softclipped, return position at which it is clipped. If not, return None
    if self.read1_is_TE: #then read2 is anchor
        if read2.is_reverse and read2.cigar[0][0] == 4: #if the read is rev and softclipped to the left, ie into the interval
            self.anchor_softclipped_pos = read2.pos
        elif read2.cigar[-1][0] == 4: # if the read is fwd an softclipped at the right, ie into the interval
            self.anchor_softclipped_pos = read2.aend
    if self.read2_is_TE: # then read2 is anchor
        if read1.is_reverse and read1.cigar[0][0] == 4:
            self.anchor_softclipped_pos = read1.pos
        elif read1.cigar[-1][0] == 4:
            self.anchor_softclipped_pos = read1.aend
    return None

So it depends on the CIGAR code. I am no expert in bwa but -M should give the secondary alignment as en extra field in the read and should not affect the soft-clipping markup. That being said try to run jitterbug with the default quality value of 15 and cluster size of 2 jitterbug.py -n 10 -b 50000000 -c 2 -q 15 -t Family myaligned.bam TE_annot.gff It should include reads with worse mapping quality and it could insert soft clipped reads too.

let me know if with more relaxed settings it includes soft clipped reads or not,

Mattia

choudhuryroy commented 8 years ago

I tried with the relaxed settings, but still I don't get the soft clipped reads.

I saw another function where the softclipped support is being set. But I cannot find from where this function is called:

  def calc_softclipped_support(self, bam_file): 
    #get softclipped reads that overlap my interval from the sofclipped bam file. these are concordant reads that have their tail softclipped
    reads = bam_file.fetch(self.chr, self.insertion_int_start, self.insertion_int_end)
    for read in reads:
        pos = softclipped_tail(read)
        if pos:
            self.softclipped_pos.append(pos)

    # to these, add the softclipped reads that were from discordant pairs actually forming the cluster
    softclipped_intervals = self.fwd_cluster.get_softclipped_pos() + self.rev_cluster.get_softclipped_pos() + self.softclipped_pos
    if len(softclipped_intervals) == 0:
        return None

    softclipped_intervals.sort()
    #for each softclipped position, count how many times it falls in the range within 4 bp more or less of another softclipped position
    pos_list = []
    #list of lists, each list containing tuples defining the st and end positions around the softclipped position.
    #each list is a set of mutually overlapping intervals
    pos_int_list = []

    current_int_list = []
    for position in softclipped_intervals:
        (st, end) = (position - 4, position + 5)

        if current_int_list == [] or st <= current_int_list[0][1]:
            current_int_list.append((st, end))
        else:
            pos_int_list.append(current_int_list)
            current_int_list = [(st, end)]
    pos_int_list.append(current_int_list)

    for int_list in pos_int_list:
        start = max(st for (st, end) in int_list)
        end = min(end for (st, end) in int_list)
        support = len(int_list)
        pos_list.append((start, end, support))
    #return list of (position, support) tuples, sorted by decreasing support
    pos_list.sort(key = lambda position : position[2], reverse=True)
    self.softclipped_support = pos_list

Could this be the reason for no softclipped support?

mbosio85 commented 8 years ago

Thanks for checking but I'm quite sure that the function doing the soft clip support check is calc_anchor_softclip.

It uses pysam for the job using their format:

cigar An alignment format string. In the python API, the cigar alignment is presented as a list of tuples (operation,length). For example, the tuple [ (0,3), (1,5), (0,2) ] refers to an alignment with 3 matches, 5 insertions and another 2 matches.

What it does it checks if the 1st or last of the tuple has a 4 as first element using this table:

M BAM_CMATCH 0 I BAM_CINS 1 D BAM_CDEL 2 N BAM_CREF_SKIP 3 S BAM_CSOFT_CLIP 4 <----- H BAM_CHARD_CLIP 5 P BAM_CPAD 6 = BAM_CEQUAL 7 X BAM_CDIFF 8

Basically it does not find any read pair which has softclipped reads on the borders of TEI. It could be related to

Checking a bit I found this : https://www.biostars.org/p/95113/

With BWA-MEM/BWA-SW, my tools are complaining about multiple primary alignments. Is it a bug? It is not. Multi-part alignments are possible in the presence of structural variations, gene fusion or reference misassembly. However, representing multi-part alignments in SAM has not been finalized. To make BWA work with your tools, please use option `-M' to flag extra hits as secondary.

  • My guess is that there is no soft-clipped read in the bam file. Or at least that the soft-clipped alignment for a specific read-pair is not the main one in the bam file.
choudhuryroy commented 8 years ago

Hi Mattia, Thanks for the suggestions. I already tried with and without the "-M" option in BWA-MEM. I was still not able to get the soft clipped support. I tried something different to find softclipped reads, I used "samtools view" on the bamfiles to convert into sam and with a simple grep I was able to find the softclipped reads within the CIGAR column of the samfiles. I will give the above method you mentioned a try.

But for me, the function "calc_anchor_is_softclipped" is working. I tried to debug it with some print statements in the "BamReader.py" where the function is actually being called, i.e. inside the function "select_read_pair_one_overlap_TE_annot"

            if read_pair.read1_is_TE and not read_pair.read2_is_TE and not is_mapped_mult_times(read2):
                if min_mapq:
                    if read2.mapq >= min_mapq:

                        read_pair.calculate_outside_interval(int_size, read1, read2)
                        read_pair.calc_anchor_is_softclipped(read1, read2)
                        print read_pair.anchor_softclipped_pos
                        ###### TODO: this is where the Aligned ReadPair objects are deleted

                        #read_pair.read1 = None
                        #read_pair.read2 = None

                        #print read_pair.read1

                        #############
                        #read_pairs_xor_overlap_TE.append(read_pair)
            tabname="[c%s_%d_%d_%s]"%(read_pair.interval_chr,bin_size*(int(read_pair.interval_start)/bin_size),bin_size*(1+int(read_pair.interval_start)/bin_size) ,read_pair.interval_direction  )
            #read_pair_database.insert(read_pair,'',tabname)
            read_pairs_xor_overlap_TE.append([read_pair,tabname])
                        ##overlap_TE_bam_file.write(read_pair.read1)
                        ##overlap_TE_bam_file.write(read_pair.read2)
                else:
                    read_pair.calculate_outside_interval(int_size, read1, read2)
                    read_pair.calc_anchor_is_softclipped(read1, read2)
                    print read_pair.anchor_softclipped_pos
                    #read_pairs_xor_overlap_TE.append(read_pair)
            tabname="[c%s_%d_%d_%s]"%(read_pair.interval_chr,bin_size*(int(read_pair.interval_start)/bin_size),bin_size*(1+int(read_pair.interval_start)/bin_size) ,read_pair.interval_direction  )
            #read_pair_database.insert(read_pair,'',tabname)
            read_pairs_xor_overlap_TE.append([read_pair,tabname])

            elif read_pair.read2_is_TE and not read_pair.read1_is_TE and not is_mapped_mult_times(read1):
                if min_mapq:
                    if read1.mapq >= min_mapq:

                        read_pair.calculate_outside_interval(int_size, read1, read2)
                        read_pair.calc_anchor_is_softclipped(read1, read2)
                        print read_pair.anchor_softclipped_pos
                        #read_pairs_xor_overlap_TE.append(read_pair)
            tabname="[c%s_%d_%d_%s]"%(read_pair.interval_chr,bin_size*(int(read_pair.interval_start)/bin_size),bin_size*(1+int(read_pair.interval_start)/bin_size) ,read_pair.interval_direction  )

            #read_pair_database.insert(read_pair,'',tabname)
            read_pairs_xor_overlap_TE.append([read_pair,tabname])
                        ##overlap_TE_bam_file.write(read_pair.read1)
                        ##overlap_TE_bam_file.write(read_pair.read2)
                else:
                    read_pair.calculate_outside_interval(int_size, read1, read2)
                    read_pair.calc_anchor_is_softclipped(read1, read2)
                    print read_pair.anchor_softclipped_pos
                    #read_pairs_xor_overlap_TE.append(read_pair)
            tabname="[c%s_%d_%d_%s]"%(read_pair.interval_chr,bin_size*(int(read_pair.interval_start)/bin_size),bin_size*(1+int(read_pair.interval_start)/bin_size) ,read_pair.interval_direction  )
            #read_pair_database.insert(read_pair,'',tabname)
            read_pairs_xor_overlap_TE.append([read_pair,tabname])

So, this should make an attribute "read_pair.anchor_softclipped_pos", which is there when I print it. It also creates "read_pairs_xor_overlap_TE", which I guess is being deleted in a few lines after it is inserted into "read_pair_database". The final length of "read_pairs_xor_overlap_TE" is 31110.

        print len(read_pairs_xor_overlap_TE)
        if len(read_pairs_xor_overlap_TE) >= 100000:
        read_pair_database.ingenerator(read_pairs_xor_overlap_TE,'read_pairs')
        #try to get the unique
        tmp = list()
        tmp = [bin_ for  pair,bin_ in read_pairs_xor_overlap_TE]
        bin_list.extend(list(set(tmp)))     
        del read_pairs_xor_overlap_TE
        read_pairs_xor_overlap_TE=list()

            #shift to next pair
            try:
                read1 = valid_discordant_bam.next()
                read2 = valid_discordant_bam.next()
            except StopIteration:
                break
    if len(read_pairs_xor_overlap_TE) >0:
        read_pair_database.ingenerator(read_pairs_xor_overlap_TE,'read_pairs')
        #try to get the unique
        tmp = list()
        tmp = [bin_ for  pair,bin_ in read_pairs_xor_overlap_TE]
        bin_list.extend(list(set(tmp)))
        del read_pairs_xor_overlap_TE
        read_pairs_xor_overlap_TE=list()
    bin_list = list(set(bin_list))
    read_pair_database.insert(bin_list,'','bin_list')
    print bin_list
        print "number discordant read pairs with exactly one read overlapping a TE: %d" % len(read_pairs_xor_overlap_TE)

I see that the bin_list is also being printed. But the "number discordant read pairs with exactly one read overlapping a TE" is always 0 as it is being deleted above.

I am worried about the function "calc_softclipped_support" as the function that writes the gff file requires an object created by it to print the softclipped support, i.e. "self.softclipped_support". I think the function is not being called and thus it is always taking "(st, end, num_support) = (-1, -1, 0)"

    def to_gff(self, cluster_pair_ID, library_name, TE_annot_tag):
        fwd_TE_tags = self.fwd_cluster.get_TE_tags(TE_annot_tag)
        rev_TE_tags = self.rev_cluster.get_TE_tags(TE_annot_tag)

        gff_tags = "supporting_fwd_reads=%d; supporting_rev_reads=%d; cluster_pair_ID=%d; lib=%s; Inserted_TE_tags_fwd=%s; Inserted_TE_tags_rev=%s; fwd_cluster_span=%d; rev_cluster_span=%d" \
                    % (self.fwd_cluster.num_reads, self.rev_cluster.num_reads, cluster_pair_ID, library_name, ", ".join(fwd_TE_tags), ", ".join(rev_TE_tags), self.fwd_cluster.span, self.rev_cluster.span)

        if self.softclipped_support:
            (st, end, num_support) = self.softclipped_support[0]
        else:
            (st, end, num_support) = (-1, -1, 0)
        gff_tags += "; softclipped_pos=(%d, %d); softclipped_support=%d; het_core_reads=%d; zygosity=%.3f" % (st, end, num_support, self.num_core_reads, self.zygosity)

        cluster_pair_chrom = self.fwd_cluster.chr

    #    #DEPRECATED: the coordinates of the region that the two clusters span, ie the UNION of the two clusters' prediction intervals
    #    #now use the insertion_interval coords calculated as the intersection
    #    cluster_pair_start = fwd_cluster[0].interval_start
    #    cluster_pair_end = rev_cluster[-1].interval_end

        (insertion_interval_start, insertion_interval_end) = self.calc_insertion_interval()

        coords = "%d\t%d" % (insertion_interval_start, insertion_interval_end)

        gff_line = "\t".join([cluster_pair_chrom, "jitterbug\tTE_insertion", coords, ".", ".", ".", gff_tags ])

        return gff_line
mbosio85 commented 8 years ago

Hi,

I spoke with Elizabeth the original developer of Jitterbug and she recommends to test a couple of things before going to debug

Anyways if you can check the 1st point it would be great to confirm if there's a bug. I think you might have found an issue, even if it's not with calc_softclipped_support() :)

In Run_TE_ID_reseq_streaming.py there's a call to a function to calculate the zygosity called cluster_pair.calc_zygosity(reads) which takes care to update the softlicpped_support value. [We never experienced the problem before probably because we use this one]

... 254 self.set_my_softclipped_support(pos_list) ...

It looks to me that this function call is not present in Run_TE_ID_reseq.py so that could explain the problem.

It might be faster for you to include this snippet at line 244 in Run_TE_ID_reseq.py and see if it solves the problem,[please check the indentation if you do it ]

  for (cluster_pairs, fwd, rev, string) in all_clusters:
for cluster_pair in cluster_pairs:
    try:
    reads = insertion_regions_reads_bam.fetch(cluster_pair.get_chr(), cluster_pair.get_insertion_int_start(), cluster_pair.get_insertion_int_end())
    cluster_pair.calc_zygosity(reads)
    except:
    print "error calculating zygosity of: "
    print cluster_pair
    raise
print "Done calculating zygosity of each cluster pair"  

If that's the case I will update the code in github

choudhuryroy commented 8 years ago

I included the snippet but it gives an error:

Traceback (most recent call last): File "/cluster/home/crimjhim/programs/jitterbug/jitterbug.py", line 129, in main(sys.argv[1:]) File "/cluster/home/crimjhim/programs/jitterbug/jitterbug.py", line 125, in main args.numCPUs, args.bin_size, args.minMAPQ, generate_test_bam, args.pre_filter, args.conf_lib_stats, mem, args.min_cluster_size) File "/cluster/home/crimjhim/programs/jitterbug/Run_TE_ID_reseq.py", line 247, in run_jitterbug reads = insertion_regions_reads_bam.fetch(cluster_pair.get_chr(), cluster_pair.get_insertion_int_start(), cluster_pair.get_insertion_int_end()) NameError: global name 'insertion_regions_reads_bam' is not defined

mbosio85 commented 8 years ago

hi from today till August 11 I'll be on holiday. I'll check it when I'm back. sorry to hear that it didn't work out smoothly

cherrs Mattia

On Tue, Aug 2, 2016, 12:18 choudhuryroy notifications@github.com wrote:

I included the snippet but it gives an error:

Traceback (most recent call last): File "/cluster/home/crimjhim/programs/jitterbug/jitterbug.py", line 129, in main(sys.argv[1:]) File "/cluster/home/crimjhim/programs/jitterbug/jitterbug.py", line 125, in main args.numCPUs, args.bin_size, args.minMAPQ, generate_test_bam, args.pre_filter, args.conf_lib_stats, mem, args.min_cluster_size) File "/cluster/home/crimjhim/programs/jitterbug/Run_TE_ID_reseq.py", line 247, in run_jitterbug

reads = insertion_regions_reads_bam.fetch(cluster_pair.get_chr(), cluster_pair.get_insertion_int_start(), cluster_pair.get_insertion_int_end())

NameError: global name 'insertion_regions_reads_bam' is not defined

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/elzbth/jitterbug/issues/5#issuecomment-236863691, or mute the thread https://github.com/notifications/unsubscribe-auth/AJeQMJr605pC5DWEdulsAm1mc9OaffQCks5qbxl7gaJpZM4JMqYC .

mbosio85 commented 8 years ago

Hi,

I'm back from holidays and yep there's no way it would have worked since there is no defined insertion_regions_bam file: lines 296 , 327- 357

So this part from Run_TE_ID_streaming is in charge of generating this file:

ins_regions_bam_name = output_prefix + ".insertion_regions.reads.bam"

construct list of args

args = ["samtools", "view", "-hb", "-L", bed_file_name, "-o", ins_regions_bam_name, psorted_bamfile_name]
# open subprocess 
int_bed_reads_select = subprocess.Popen(args)
# wait till it finishes
outcode = int_bed_reads_select.wait()    
if outcode  == 0:
    print "retrieving reads overlapping bed annots successful"
    # construct list of args 
    args = ["samtools", "index", ins_regions_bam_name]
    # open subprocess 
    int_bed_reads_index = subprocess.Popen(args)
    # wait till it finishes
    outcode = int_bed_reads_index.wait()
    if outcode == 0:
        print "indexing successful"
    else:
        print "indexing failed"
else:
    command = "\t".join(args)
    print "retrieving reads overlapping bed annots failed! command: %s " % (command)
    sys.exit(1)
insertion_regions_reads_bam = pysam.Samfile(ins_regions_bam_name, mode="rb")

Can you try to add this before the snippet and see if now it runs ?

choudhuryroy commented 8 years ago

Hello Mattia,

Sorry for the delay. So I added the following lines at 244 line number in the file Run_TE_ID_reseq.py. It now produces the softclipped supports and the zygosity. Thanks a lot.

    ins_regions_bam_name = output_prefix + ".insertion_regions.reads.bam"
    args = ["samtools", "view", "-hb", "-L", bed_file_name, "-o", ins_regions_bam_name, psorted_bamfile_name]
    # open subprocess 
    int_bed_reads_select = subprocess.Popen(args)
    # wait till it finishes
    outcode = int_bed_reads_select.wait()
    if outcode  == 0:
        print "retrieving reads overlapping bed annots successful"
        # construct list of args 
        args = ["samtools", "index", ins_regions_bam_name]
        # open subprocess 
        int_bed_reads_index = subprocess.Popen(args)
        # wait till it finishes
        outcode = int_bed_reads_index.wait()
        if outcode == 0:
                print "indexing successful"
        else:
                print "indexing failed"
    else:
        command = "\t".join(args)
        print "retrieving reads overlapping bed annots failed! command: %s " % (command)
        sys.exit(1)
    insertion_regions_reads_bam = pysam.Samfile(ins_regions_bam_name, mode="rb")
    for (cluster_pairs, fwd, rev, string) in all_clusters:
        for cluster_pair in cluster_pairs:
                try:
                        reads = insertion_regions_reads_bam.fetch(cluster_pair.get_chr(), cluster_pair.get_insertion_int_start(), cluster_pair.get_insertion_int_end())
                        cluster_pair.calc_zygosity(reads)
                except:
                        print "error calculating zygosity of: "
                        print cluster_pair
                        raise
    print "Done calculating zygosity of each cluster pair"
mbosio85 commented 7 years ago

Thanks for trying Now I added the code to the current version of jitterbug . Thanks for helping to solve this issue.

Mattia