adamewing / tldr

Identify and annotate TE-mediated insertions in long-read sequence data
MIT License
40 stars 4 forks source link

"NoneType" TypeError when clustering reads #27

Open malonematt opened 2 years ago

malonematt commented 2 years ago

Hi Adam,

Thanks for all of the help you've given me using your software.

I ran into the following error:

2022-05-24 14:49:35,184 tldr started with command: /home1/malonema/.local/bin/tldr -b bams/OF1_sorted_mappings.bam -r resources/Masked_Genome_061021.fa -e none -p 20 -o results/OF1.tldr --detail_output --extend_consensus 2000 2022-05-24 14:49:35,184 output basename: results/OF1.tldr 2022-05-24 14:49:35,636 "None" passed to -e/--elts, running without TE reference 2022-05-24 14:49:36,409 writing clusters to results/OF1.tldr/JAAVVJ010000099.1.pickle 2022-05-24 14:49:37,158 writing clusters to results/OF1.tldr/JAAVVJ010009971.1.pickle 2022-05-24 14:49:39,252 writing clusters to results/OF1.tldr/CM025019.1.pickle ... 2022-05-24 14:52:07,881 writing clusters to results/OF1.tldr/JAAVVJ010009963.1.pickle 2022-05-24 14:52:08,399 loaded 504 clusters from results/OF1.tldr/CM025008.1.pickle multiprocessing.pool.RemoteTraceback: """ Traceback (most recent call last): File "/spack/apps2/linux-centos7-x86_64/gcc-11.2.0/python-3.9.6-5amy32qig2nbj7ti7ehht3y2vbmdc2j7/lib/python3.9/multiprocessing/pool.py", line 125, in worker result = (True, func(*args, **kwds)) File "/home1/malonema/.local/bin/tldr", line 1525, in process_cluster qual = qual[::-1] TypeError: 'NoneType' object is not subscriptable """

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/home1/malonema/.local/bin/tldr", line 2128, in main(args) File "/home1/malonema/.local/bin/tldr", line 1907, in main processed_clusters.append(res.get()) File "/spack/apps2/linux-centos7-x86_64/gcc-11.2.0/python-3.9.6-5amy32qig2nbj7ti7ehht3y2vbmdc2j7/lib/python3.9/multiprocessing/pool.py", line 771, in get raise self._value TypeError: 'NoneType' object is not subscriptable

Any idea what might be causing this?

malonematt commented 2 years ago

From what I can tell this might be happening because on line 1521 qual is set to read.qual:

`for read in bam.fetch(cluster.chrom(), out_start, out_end): if not read.is_secondary and not read.is_supplementary: seq = read.seq qual = read.qual

                                if read.is_reverse:
                                    seq = rc(seq)
                                    qual = qual[::-1]`

The only other mention of read.qual is on line 1650, when ins_read is being defined:

ins_read = InsRead(bam.filename.decode(), read.reference_name, q_start, q_end, r_start, r_end, read.qname, read.seq, read.qual, read.mapq, is_ins, is_clip, clip_end, phase)

I'm still very new to python, is this issue caused because read.qual has not been defined yet?

malonematt commented 2 years ago

I checked my bam with samtools view file.bam, and they have quality scores (or at least some do, I havent checked if every aligned read does yet). Could it be that some of the reads don't have quality scores?

adamewing commented 2 years ago

Hi, sorry for the delay. It's likely one or more read alignment records are misssing quality scores (and sequences) as I've seen this come up in other software with minimap2 .bams.

I've pushed a fix that will skip the offending alignments at that point and complain about it a bit so you can track down the read if you like: ae3cdb8

It's possible you'll hit this elsewhere in the code though so let me know if it comes up again.

Regarding your q about read.qual, that's set by pysam when the read is parsed into their AlignedSegment class (if I have the name right).

malonematt commented 2 years ago

I just ran it after double-checking that the code was updated with your fix and it still threw the same error:

2022-05-25 18:14:48,448 loaded 504 clusters from results/OF1.tldr/CM025008.1.pickle multiprocessing.pool.RemoteTraceback: """ Traceback (most recent call last): File "/spack/apps2/linux-centos7-x86_64/gcc-11.2.0/python-3.9.6-5amy32qig2nbj7ti7ehht3y2vbmdc2j7/lib/python3.9/multiprocessing/pool.py", line 125, in worker result = (True, func(*args, **kwds)) File "/home1/malonema/.local/bin/tldr", line 1525, in process_cluster qual = qual[::-1] TypeError: 'NoneType' object is not subscriptable """

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/home1/malonema/.local/bin/tldr", line 2128, in main(args) File "/home1/malonema/.local/bin/tldr", line 1907, in main processed_clusters.append(res.get()) File "/spack/apps2/linux-centos7-x86_64/gcc-11.2.0/python-3.9.6-5amy32qig2nbj7ti7ehht3y2vbmdc2j7/lib/python3.9/multiprocessing/pool.py", line 771, in get raise self._value TypeError: 'NoneType' object is not subscriptable

So I guess it wasn't that problem?

malonematt commented 2 years ago

Although, now that I'm going through it, that qual = qual[[::-1]] line isn't on line 1525 in the updated code .. it's on 1529.

So maybe it's just still running the old tldr.

malonematt commented 2 years ago

Yup, the one in my actual conda directory didn't update. Which I find confusing since I did a fresh install. But I'll just update it by hand. I'll let you know if this fixes the problem.

malonematt commented 2 years ago

So that fix was able to resolve that error, but now it generates a different error:

2022-05-26 02:48:27,604 skipped a read without seq/qual: ed5106b4-53fa-4f23-85ed-f1720a965b0d 2022-05-26 02:48:27,604 skipped a read without seq/qual: dbfb5709-60e4-4cbd-abfc-0971901cbcaf multiprocessing.pool.RemoteTraceback: """ Traceback (most recent call last): File "/spack/apps2/linux-centos7-x86_64/gcc-11.2.0/python-3.9.6-5amy32qig2nbj7ti7ehht3y2vbmdc2j7/lib/python3.9/multiprocessing/pool.py", line 125, in worker result = (True, func(*args, **kwds)) File "/home1/malonema/.local/bin/tldr", line 1536, in process_cluster cluster.spanning_non_supporting_reads(int(args.wiggle), int(args.min_te_len)) File "/home1/malonema/.local/bin/tldr", line 399, in spanning_non_supporting_reads for r in bam.fetch(self.chrom(), te_ins_start, te_ins_end): File "pysam/libcalignmentfile.pyx", line 1091, in pysam.libcalignmentfile.AlignmentFile.fetch File "pysam/libchtslib.pyx", line 690, in pysam.libchtslib.HTSFile.parse_region ValueError: start out of range (-271) """

The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/home1/malonema/.local/bin/tldr", line 2132, in main(args) File "/home1/malonema/.local/bin/tldr", line 1911, in main processed_clusters.append(res.get()) File "/spack/apps2/linux-centos7-x86_64/gcc-11.2.0/python-3.9.6-5amy32qig2nbj7ti7ehht3y2vbmdc2j7/lib/python3.9/multiprocessing/pool.py", line 771, in get raise self._value ValueError: start out of range (-271)

It looks like these are unrelated, so let me know if you'd like me to open a separate issue for it.

malonematt commented 2 years ago

I found a somewhat similar issue here: https://github.com/ComputationalSystemsBiology/ExoProfiler/issues/6

In this one it's that a region near the start of a chromosome is extended beyond the start, resulting in a negative value. Which upsets pysam.

Do you think --extend_consensus could be causing the same problem?

malonematt commented 2 years ago

Looks like you ran into this problem in: https://github.com/adamewing/tldr/issues/8 I'll go look at your fix to see if that helps me figure things out

malonematt commented 2 years ago

I did try another run without --extend_consensus (but still with --detail_output) and it threw the same error

malonematt commented 2 years ago

I think what's happening is in this section of code ` for bampath in set([read.bampath for read in self.reads if read.useable]): bamname = '.'.join(os.path.basename(bampath).split('.')[:-1]) bam = pysam.AlignmentFile(bampath)

        te_ins_start = int(self.breakpoints[0])
        te_ins_end   = int(self.breakpoints[1])

        for r in bam.fetch(self.chrom(), te_ins_start, te_ins_end):
            if r.is_secondary or r.is_supplementary:
                continue

` Where I think te_ins_start is getting assigned that negative number. Just a guess.