Regarding nthreads, pvalue option, alignment parser

koriege commented 8 months ago

Hi,

I really appreciate your re-implementation, since the original version caused up to ~200Gb memory usage in my case. I know, that summit is really new, but I want to report some issues I stumbled across when trying it out. Regards.

1) summit.py -> Error. nthreads cannot be larger than environment variable "NUMEXPR_MAX_THREADS" (64)

Solution via OMP_NUM_THREADS variable

2) OMP_NUM_THREADS=28 summit.py -p 0.05 -> summit.py error: argument -p/--pval: invalid int value: '0.05'

Solution: argparse type set to float

3) OMP_NUM_THREADS=28 summit.py -c <bam> -b <bam> -o <out> -w 100 -m 500 -t 100 -l 50

Traceback (most recent call last):
  File "/misc/paras/data/programs/bashbone/conda/envs/gopeaks/bin/summit.py", line 485, in <module>
    ttracks, nreads = count_tracks(bam, bins, nreads = True)
  File "/misc/paras/data/programs/bashbone/conda/envs/gopeaks/bin/summit.py", line 178, in count_tracks
    if r1.is_forward & r2.is_reverse:
AttributeError: 'pysam.libcalignedsegment.AlignedSegment' object has no attribute 'is_forward'

Do I need to sort the alignments by name?

gartician commented 8 months ago

Hi @koriege,

I appreciate the enthusiasm for the increased efficiency of Summit! To get back to your suggestions...

Since Summit was developed without parallel processing, does setting OMP_NUM_THREADS really utilize your cores?
This will be fixed, thank you for the attention to detail!
This error means that a read was not designated as forward or reverse. Perhaps consider removing unmapped or reads without a mate from the BAM file before using Summit? I typically use base pair-sorted BAM files in my analysis rather than name-sorted, but I'm not sure it matters since that code is pulling and assessing read mates.

koriege commented 8 months ago

Thanks for the quick reply.

Since Summit was developed without parallel processing, does setting OMP_NUM_THREADS really utilize your cores?

No, it does not. But one of your libraries, that uses nthreads infers number of cores, which is 128 in my case. Thus setting OMP_NUM_THREADS to anything <=64 avoids conflicts.

This error means that a read was not designated as forward or reverse. Perhaps consider removing unmapped or reads without a mate from the BAM file before using Summit?

I did. I always work with unique alignments. The only flag which was removed from the BAM in some cases, due to region blacklisting, is the "proper_pair" flag, although both mates exist (one reverse, the other mate_reverse flagged). I will re-add the proper_pair flag, try again and report back. Thanks for pointing me into this direction.

koriege commented 8 months ago

This error means that a read was not designated as forward or reverse. Perhaps consider removing unmapped or reads without a mate from the BAM file before using Summit?

After having extensively tested, it seems that your implementation does not work. The pysam object simply does not have an attribute is_forward. According to the pysam docs: is_forward - true if read is mapped to forward strand (implemented in terms of is_reverse)

see this tiny example

NB501242:295:HV7G2BGXV:4:11610:20759:10780      99      chr1    10074   9       4S27=1X4=1X3=   =       10392   345     CCCTAACCCTAACCCTAACCCTAACCCTAACACTAAGCCA        AAAAA/EEAEEAEAEEAEE<EEE/EEEE//E/EEAE//<<        MD:Z:27C4C3     NH:i:1H
I:i:0   NM:i:2  UQ:i:28 YZ:Z:0  RG:Z:A1 MQ:i:6  MC:Z:4=1D22=5S
NB501242:295:HV7G2BGXV:4:11610:20759:10780      147     chr1    10392   6       4=1D22=5S       =       10074   -345    CCTACCCCTAACCCTAACCCTAACCCTAACC E//E/EAEEAEEEEAEEEEE6EEEEEAAAAA MD:Z:4^A22      NH:i:1  HI:i:0  NM:i:1  UQ:i:0  YZ:Z:0R
G:Z:A1  MQ:i:9  MC:Z:4S27=1X4=1X3=

here r1 attributes are

['__class__', '__copy__', '__deepcopy__', '__delattr__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__ne__', '__new__', '__pyx_vta
ble__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setstate__', '__sizeof__', '__str__', '__subclasshook__', 'aend', 'alen', 'aligned_pairs', 'bin', 'blocks', 'cigar', 'cigarstring', 'cigartuples', 'compare', 'flag', 'from_dict', 'fromstring', 'get_aligned_pairs', 'get_blocks', 'get_cigar_stats', 'get_forward_qualities', 'get_forward_sequence', 'get_overlap', 'get_reference_positions', 'get_reference_sequence', 'get_tag', 'get_tags', 'has_tag', 'header', '
infer_query_length', 'infer_read_length', 'inferred_length', 'is_duplicate', 'is_paired', 'is_proper_pair', 'is_qcfail', 'is_read1', 'is_read2', 'is_reverse', 'is_secondary', 'is_supplementary', 'is_unmapped', 'isize', 'mapping_quality', '
mapq', 'mate_is_reverse', 'mate_is_unmapped', 'mpos', 'mrnm', 'next_reference_id', 'next_reference_name', 'next_reference_start', 'opt', 'overlap', 'pnext', 'pos', 'positions', 'qend', 'qlen', 'qname', 'qqual', 'qstart', 'qual', 'query', '
query_alignment_end', 'query_alignment_length', 'query_alignment_qualities', 'query_alignment_sequence', 'query_alignment_start', 'query_length', 'query_name', 'query_qualities', 'query_sequence', 'reference_end', 'reference_id', 'reference_length', 'reference_name', 'reference_start', 'rlen', 'rname', 'rnext', 'seq', 'setTag', 'set_tag', 'set_tags', 'tags', 'template_length', 'tid', 'tlen', 'to_dict', 'to_string', 'tostring']

So the solution is for line 178 and following

        if not r1.is_reverse and r2.is_reverse:
            r_start, r_end = r1.reference_start, r2.reference_end
        elif r1.is_reverse and not r2.is_reverse:
            r_start, r_end = r2.reference_start, r1.reference_end
        else:
            exit(f"Read pair R1 {r1} and R2 {r2} are not complementary! Either both are forward or reverse reads!")

Btw. is_reverse attribute is of type bool, whereas the & operator is made for bitwise comparision. You may change it to and

Plase also consider to add a proper shebang line to your script.

#!/usr/bin/env python3

gartician / summit

Regarding nthreads, pvalue option, alignment parser #1