caleblareau / mgatk

mgatk: mitochondrial genome analysis toolkit
http://caleblareau.github.io/mgatk
MIT License
98 stars 25 forks source link

sumstatsBP_overlap.py failed to handle paired-end reads with unequal R1 and R2 lengths #85

Open Qirong-Lin opened 5 months ago

Qirong-Lin commented 5 months ago

Describe the bug Hi Caleb, thanks for developing this amazing tool! I'm running a atac-seq dataset with unequal R1 and R2 read lengths due as I trimmed the low-quality area of R1 after sequencing. It throws out the error after running:

Traceback (most recent call ast):
File"/home/qirong/anaconda3/envs/mgatk py38/ib/python3.8/site-packages/mgatk/bin/python/sumstatsBP overlap.py",line 107,in <module>
fwd overlap use idx = np.where(fwd overlap quality > rev overlap quality)[0]ValueError: operands could not be broadcast together with shapes (53,) (52,)

After checking the code, this is because R1 covers the entire sequence or R2, or vice versa, which perhaps causes the error when extracting the overlapped area. i.e. R1 contains R1 only, overlapped region, R1 only area, while the entire R2 is overlapped with R1.

Additional context I modified the sumstatsBP_overlap.py by adding codes to detect whether such overlap happens and extract the overlapped regions and exclusive regions by R1_only_5prime, overlapped, R1_only_3prime, or vice versa. sumstatsBP_overlap.zip I'm not 100% sure whether this is the proper way and looking forward to discussing more about this.

Best, Qirong