PacificBiosciences / kineticsTools

Tools for detecting DNA modifications from single molecule, real-time sequencing data
19 stars 21 forks source link

BAM issue #58

Closed weedcentipede closed 6 years ago

weedcentipede commented 6 years ago

Hey, I got the following error while I was running ipdSummary, I used samtools(v1.7) to generate the .sorted.bam file (I'm using the sorted.bam because apparently the software needs of an indexed .bam and a normal .bam cannot be indexed). In this matters, I used Minimap2 to generate the .sam file.

This is the error,

Traceback (most recent call last):
  File "/home/bio/Desktop/ASM_NGS/Pacbio/apps/kineticsTools/src/pbcommand/pbcommand/cli/core.py", line 137, in _pacbio_main_runner
    return_code = exe_main_func(*args, **kwargs)
  File "/home/bio/anaconda3/envs/Pacbio/lib/python2.7/site-packages/kineticsTools-0.6.1-py2.7-linux-x86_64.egg/kineticsTools/ipdSummary.py", line 699, in args_runner
    return kt.start()
  File "/home/bio/anaconda3/envs/Pacbio/lib/python2.7/site-packages/kineticsTools-0.6.1-py2.7-linux-x86_64.egg/kineticsTools/ipdSummary.py", line 412, in start
    return self.run()
  File "/home/bio/anaconda3/envs/Pacbio/lib/python2.7/site-packages/kineticsTools-0.6.1-py2.7-linux-x86_64.egg/kineticsTools/ipdSummary.py", line 475, in run
    ret = self._mainLoop()
  File "/home/bio/anaconda3/envs/Pacbio/lib/python2.7/site-packages/kineticsTools-0.6.1-py2.7-linux-x86_64.egg/kineticsTools/ipdSummary.py", line 603, in _mainLoop
    self.loadSharedAlignmentSet(self.args.alignment_set)
  File "/home/bio/anaconda3/envs/Pacbio/lib/python2.7/site-packages/kineticsTools-0.6.1-py2.7-linux-x86_64.egg/kineticsTools/ipdSummary.py", line 579, in loadSharedAlignmentSet
    referenceFastaFname=self.args.reference)
  File "/home/bio/Desktop/ASM_NGS/Pacbio/apps/kineticsTools/src/pbcore/pbcore/io/dataset/DataSetIO.py", line 2696, in __init__
    super(AlignmentSet, self).__init__(*files, **kwargs)
  File "/home/bio/Desktop/ASM_NGS/Pacbio/apps/kineticsTools/src/pbcore/pbcore/io/dataset/DataSetIO.py", line 1960, in __init__
    super(ReadSet, self).__init__(*files, **kwargs)
  File "/home/bio/Desktop/ASM_NGS/Pacbio/apps/kineticsTools/src/pbcore/pbcore/io/dataset/DataSetIO.py", line 450, in __init__
    self.updateCounts()
  File "/home/bio/Desktop/ASM_NGS/Pacbio/apps/kineticsTools/src/pbcore/pbcore/io/dataset/DataSetIO.py", line 2514, in updateCounts
    self.assertIndexed()
  File "/home/bio/Desktop/ASM_NGS/Pacbio/apps/kineticsTools/src/pbcore/pbcore/io/dataset/DataSetIO.py", line 2344, in assertIndexed
    self._assertIndexed((IndexedBamReader, CmpH5Reader))
  File "/home/bio/Desktop/ASM_NGS/Pacbio/apps/kineticsTools/src/pbcore/pbcore/io/dataset/DataSetIO.py", line 1917, in _assertIndexed
    self._openFiles()
  File "/home/bio/Desktop/ASM_NGS/Pacbio/apps/kineticsTools/src/pbcore/pbcore/io/dataset/DataSetIO.py", line 2041, in _openFiles
    resource = IndexedBamReader(location)
  File "/home/bio/Desktop/ASM_NGS/Pacbio/apps/kineticsTools/src/pbcore/pbcore/io/align/BamIO.py", line 366, in __init__
    super(IndexedBamReader, self).__init__(fname, referenceFastaFname)
  File "/home/bio/Desktop/ASM_NGS/Pacbio/apps/kineticsTools/src/pbcore/pbcore/io/align/BamIO.py", line 175, in __init__
    self._checkFileCompatibility()
  File "/home/bio/Desktop/ASM_NGS/Pacbio/apps/kineticsTools/src/pbcore/pbcore/io/align/BamIO.py", line 164, in _checkFileCompatibility
    checkedVersion = self.version
  File "/home/bio/Desktop/ASM_NGS/Pacbio/apps/kineticsTools/src/pbcore/pbcore/io/align/BamIO.py", line 252, in version
    return self.peer.header["HD"]["pb"]
KeyError: 'pb'
2018-06-29 19:25:45,565 [ERROR] 'pb'

Can anybody please help me, Thanks in advance, Cheers, Luis Alfonso.

weedcentipede commented 6 years ago

The previous error apparently was for using a mapper different of blasr, after mapping all the sequences with blasr, I ran again ipdSummary and get the following error:

'FRAMERATEHZ'
Traceback (most recent call last):
  File "/home/bio/Desktop/ASM_NGS/Pacbio/apps/kineticsTools/src/pbcommand/pbcommand/cli/core.py", line 137, in _pacbio_main_runner
    return_code = exe_main_func(*args, **kwargs)
  File "/home/bio/anaconda3/envs/Pacbio/lib/python2.7/site-packages/kineticsTools-0.6.1-py2.7-linux-x86_64.egg/kineticsTools/ipdSummary.py", line 699, in args_runner
    return kt.start()
  File "/home/bio/anaconda3/envs/Pacbio/lib/python2.7/site-packages/kineticsTools-0.6.1-py2.7-linux-x86_64.egg/kineticsTools/ipdSummary.py", line 412, in start
    return self.run()
  File "/home/bio/anaconda3/envs/Pacbio/lib/python2.7/site-packages/kineticsTools-0.6.1-py2.7-linux-x86_64.egg/kineticsTools/ipdSummary.py", line 475, in run
    ret = self._mainLoop()
  File "/home/bio/anaconda3/envs/Pacbio/lib/python2.7/site-packages/kineticsTools-0.6.1-py2.7-linux-x86_64.egg/kineticsTools/ipdSummary.py", line 603, in _mainLoop
    self.loadSharedAlignmentSet(self.args.alignment_set)
  File "/home/bio/anaconda3/envs/Pacbio/lib/python2.7/site-packages/kineticsTools-0.6.1-py2.7-linux-x86_64.egg/kineticsTools/ipdSummary.py", line 579, in loadSharedAlignmentSet
    referenceFastaFname=self.args.reference)
  File "/home/bio/Desktop/ASM_NGS/Pacbio/apps/kineticsTools/src/pbcore/pbcore/io/dataset/DataSetIO.py", line 2696, in __init__
    super(AlignmentSet, self).__init__(*files, **kwargs)
  File "/home/bio/Desktop/ASM_NGS/Pacbio/apps/kineticsTools/src/pbcore/pbcore/io/dataset/DataSetIO.py", line 1960, in __init__
    super(ReadSet, self).__init__(*files, **kwargs)
  File "/home/bio/Desktop/ASM_NGS/Pacbio/apps/kineticsTools/src/pbcore/pbcore/io/dataset/DataSetIO.py", line 450, in __init__
    self.updateCounts()
  File "/home/bio/Desktop/ASM_NGS/Pacbio/apps/kineticsTools/src/pbcore/pbcore/io/dataset/DataSetIO.py", line 2514, in updateCounts
    self.assertIndexed()
  File "/home/bio/Desktop/ASM_NGS/Pacbio/apps/kineticsTools/src/pbcore/pbcore/io/dataset/DataSetIO.py", line 2344, in assertIndexed
    self._assertIndexed((IndexedBamReader, CmpH5Reader))
  File "/home/bio/Desktop/ASM_NGS/Pacbio/apps/kineticsTools/src/pbcore/pbcore/io/dataset/DataSetIO.py", line 1917, in _assertIndexed
    self._openFiles()
  File "/home/bio/Desktop/ASM_NGS/Pacbio/apps/kineticsTools/src/pbcore/pbcore/io/dataset/DataSetIO.py", line 2041, in _openFiles
    resource = IndexedBamReader(location)
  File "/home/bio/Desktop/ASM_NGS/Pacbio/apps/kineticsTools/src/pbcore/pbcore/io/align/BamIO.py", line 366, in __init__
    super(IndexedBamReader, self).__init__(fname, referenceFastaFname)
  File "/home/bio/Desktop/ASM_NGS/Pacbio/apps/kineticsTools/src/pbcore/pbcore/io/align/BamIO.py", line 178, in __init__
    self._loadReadGroupInfo()
  File "/home/bio/Desktop/ASM_NGS/Pacbio/apps/kineticsTools/src/pbcore/pbcore/io/align/BamIO.py", line 92, in _loadReadGroupInfo
    rgFrameRate = ds["FRAMERATEHZ"]
KeyError: 'FRAMERATEHZ'

If anyone knows why this is happening, I would appreciate your help, Thanks in advance, Luis Alfonso

natechols commented 6 years ago

What is the output of running samtools view -H on your input BAM file?

weedcentipede commented 6 years ago

Hi Nate, this is the output for samtools view -H

(Pacbio) root@biocomp04:~/Desktop/ASM_NGS/Pacbio# samtools view -H aln_blasr.sorted.bam
@HD VN:1.5  SO:coordinate   pb:3.0.1
@SQ SN:contig1  LN:4012368  M5:8a7a4ce43d27b52117e19c93a0704880
@RG ID:6a355473 PL:PACBIO   DS:READTYPE=SUBREAD;DeletionQV=dq;DeletionTag=dt;InsertionQV=iq;MergeQV=mq;SubstitutionQV=sq;BINDINGKIT=100372700;SEQUENCINGKIT=100612400;BASECALLERVERSION=2.3 PU:m171222_144738_42244_c101415012550000001823309602281861_s1_p0    PM:SEQUEL
@PG ID:1    PN:BLASR    VN:5.3.2    CL:blasr --bam --out aln_blasr.bam --nproc 32 ./RawData/Analysis_Results/m171222_144738_42244_c101415012550000001823309602281861_s1_p0.bas.h5 ./ref.fasta 
(Pacbio) root@biocomp04:~/Desktop/ASM_NGS/Pacbio# samtools view -H aln_blasr.bam
@HD VN:1.5  SO:UNKNOWN  pb:3.0.1
@SQ SN:contig1  LN:4012368  M5:8a7a4ce43d27b52117e19c93a0704880
@RG ID:6a355473 PL:PACBIO   DS:READTYPE=SUBREAD;DeletionQV=dq;DeletionTag=dt;InsertionQV=iq;MergeQV=mq;SubstitutionQV=sq;BINDINGKIT=100372700;SEQUENCINGKIT=100612400;BASECALLERVERSION=2.3 PU:m171222_144738_42244_c101415012550000001823309602281861_s1_p0    PM:SEQUEL
@PG ID:1    PN:BLASR    VN:5.3.2    CL:blasr --bam --out aln_blasr.bam --nproc 32 ./RawData/Analysis_Results/m171222_144738_42244_c101415012550000001823309602281861_s1_p0.bas.h5 ./ref.fasta 

Is everything OK?

weedcentipede commented 6 years ago
@HD VN:1.5  SO:UNKNOWN  pb:3.0.1
@SQ SN:contig1  LN:4012368  M5:8a7a4ce43d27b52117e19c93a0704880
@RG ID:6a355473 PL:PACBIO   DS:READTYPE=SUBREAD;DeletionQV=dq;DeletionTag=dt;InsertionQV=iq;MergeQV=mq;SubstitutionQV=sq;Ipd:CodecV1=ip;PulseWidth:CodecV1=pw;BINDINGKIT=100372700;SEQUENCINGKIT=100612400;BASECALLERVERSION=2.3.0.4.187002;FRAMERATEHZ=75.000000   PU:m171222_144738_42244_c101415012550000001823309602281861_s1_p0    PM:RS
@PG ID:2    PN:BLASR    VN:5.3.2    CL:blasr --bam --out baxxxxreads.bam --nproc 32 ./baxxxreads.bam ./ref.fasta 
@PG ID:bax2bam-0.0.8    PN:bax2bam  VN:0.0.8    DS:bax2bam converts the legacy PacBio basecall format (bax.h5) into the BAM basecall format.    CL:bax2bam RawData/Analysis_Results/m171222_144738_42244_c101415012550000001823309602281861_s1_p0.1.bax.h5 RawData/Analysis_Results/m171222_144738_42244_c101415012550000001823309602281861_s1_p0.2.bax.h5 RawData/Analysis_Results/m171222_144738_42244_c101415012550000001823309602281861_s1_p0.3.bax.h5 

Is this the .bam that I need for the analysis?

natechols commented 6 years ago

That looks correct, yes. I'm surprised that Blasr can output a BAM from a bax.h5 input without complaining, but obviously it's not doing a complete conversion like bax2bam is.

weedcentipede commented 6 years ago

After sorting and indexing that .bam

I got this:

Process KineticWorkerProcess-1:
Traceback (most recent call last):
  File "/home/bio/anaconda3/envs/Pacbio/lib/python2.7/multiprocessing/process.py", line 267, in _bootstrap
    self.run()
  File "/home/bio/anaconda3/envs/Pacbio/lib/python2.7/site-packages/kineticsTools-0.6.1-py2.7-linux-x86_64.egg/kineticsTools/WorkerProcess.py", line 179, in run
    self._run()
  File "/home/bio/anaconda3/envs/Pacbio/lib/python2.7/site-packages/kineticsTools-0.6.1-py2.7-linux-x86_64.egg/kineticsTools/WorkerProcess.py", line 158, in _run
    result = self.onChunk(datum)
  File "/home/bio/anaconda3/envs/Pacbio/lib/python2.7/site-packages/kineticsTools-0.6.1-py2.7-linux-x86_64.egg/kineticsTools/KineticWorker.py", line 113, in onChunk
    perSiteResults = self._summarizeReferenceRegion((padStart, padEnd), self.options.methylFraction, self.options.identify)
  File "/home/bio/anaconda3/envs/Pacbio/lib/python2.7/site-packages/kineticsTools-0.6.1-py2.7-linux-x86_64.egg/kineticsTools/KineticWorker.py", line 207, in _summarizeReferenceRegion
    (caseChunks, capValue) = self._fetchChunks(caseReferenceGroupId, targetBounds, self.caseCmpH5)
  File "/home/bio/anaconda3/envs/Pacbio/lib/python2.7/site-packages/kineticsTools-0.6.1-py2.7-linux-x86_64.egg/kineticsTools/KineticWorker.py", line 372, in _fetchChunks
    max(start, 0), end)
  File "/home/bio/anaconda3/envs/Pacbio/lib/python2.7/site-packages/pbcore-1.5.0-py2.7.egg/pbcore/io/dataset/DataSetIO.py", line 3482, in <genexpr>
    return (read for read in self._readsInRange(refName, start, end,
  File "/home/bio/anaconda3/envs/Pacbio/lib/python2.7/site-packages/pbcore-1.5.0-py2.7.egg/pbcore/io/dataset/DataSetIO.py", line 148, in wrapper
    for read in generator(dset, *args, **kwargs):
  File "/home/bio/anaconda3/envs/Pacbio/lib/python2.7/site-packages/pbcore-1.5.0-py2.7.egg/pbcore/io/dataset/DataSetIO.py", line 3493, in _readsInRange
    sampleSize=sampleSize):
  File "/home/bio/anaconda3/envs/Pacbio/lib/python2.7/site-packages/pbcore-1.5.0-py2.7.egg/pbcore/io/dataset/DataSetIO.py", line 3334, in _pbiReadsInRange
    passes = self._indexReadsInRange(refName, start, end, justIndices=True)
  File "/home/bio/anaconda3/envs/Pacbio/lib/python2.7/site-packages/pbcore-1.5.0-py2.7.egg/pbcore/io/dataset/DataSetIO.py", line 3306, in _indexReadsInRange
    (self.index.tStart < end) &
  File "/home/bio/anaconda3/envs/Pacbio/lib/python2.7/site-packages/numpy/core/records.py", line 452, in __getattribute__
    raise AttributeError("recarray has no attribute %s" % attr)
AttributeError: recarray has no attribute tId
Child process exited with exitcode=1.  Aborting.
2018-06-30 01:54:36,412 [ERROR] Child process exited with exitcode=1.  Aborting.

Any idea of what it's happening? Thanks!

PS, in that directory I kept the .bam.pdi file

weedcentipede commented 6 years ago

Apparently after indexing that .bam with pbindex all worked... thanks @natechols for the guidance. PS, Is there a way to run it in multithread or MPI?

Cheers

rhallPB commented 6 years ago

Can you post the ipdSummary command used, some of the output options are not valid without a very specific path to generate the input. Try specifying --gff <out gff> --csv <out csv> instead of the generic --out parameter.

The --numWorkers parameter controls the number of threads used. For splitting into bigger chunks, you will need to split the aligned bam into chunks.

jgonzalez10 commented 5 years ago

Hi, I'm having trouble with ipdSummary too. The error I'm getting it the following:

Traceback (most recent call last): File "/usr/lib/python2.7/dist-packages/pbcommand/cli/core.py", line 136, in _pacbio_main_runner return_code = exe_main_func(*args, kwargs) File "/usr/lib/python2.7/dist-packages/kineticsTools/ipdSummary.py", line 719, in args_runner return kt.start() File "/usr/lib/python2.7/dist-packages/kineticsTools/ipdSummary.py", line 393, in start return self.run() File "/usr/lib/python2.7/dist-packages/kineticsTools/ipdSummary.py", line 455, in run ret = self._mainLoop() File "/usr/lib/python2.7/dist-packages/kineticsTools/ipdSummary.py", line 626, in _mainLoop self.loadSharedAlignmentSet(self.args.alignment_set) File "/usr/lib/python2.7/dist-packages/kineticsTools/ipdSummary.py", line 602, in loadSharedAlignmentSet referenceFastaFname=self.args.reference) File "/usr/lib/python2.7/dist-packages/pbcore/io/dataset/DataSetIO.py", line 2551, in init super(AlignmentSet, self).init(*files, *kwargs) File "/usr/lib/python2.7/dist-packages/pbcore/io/dataset/DataSetIO.py", line 1876, in init super(ReadSet, self).init(files, kwargs) File "/usr/lib/python2.7/dist-packages/pbcore/io/dataset/DataSetIO.py", line 470, in init self.updateCounts() File "/usr/lib/python2.7/dist-packages/pbcore/io/dataset/DataSetIO.py", line 2392, in updateCounts self.assertIndexed() File "/usr/lib/python2.7/dist-packages/pbcore/io/dataset/DataSetIO.py", line 2193, in assertIndexed self._assertIndexed((IndexedBamReader, CmpH5Reader)) File "/usr/lib/python2.7/dist-packages/pbcore/io/dataset/DataSetIO.py", line 1833, in _assertIndexed self._openFiles() File "/usr/lib/python2.7/dist-packages/pbcore/io/dataset/DataSetIO.py", line 1953, in _openFiles resource = IndexedBamReader(location) File "/usr/lib/python2.7/dist-packages/pbcore/io/align/BamIO.py", line 385, in init super(IndexedBamReader, self).init(fname, referenceFastaFname) File "/usr/lib/python2.7/dist-packages/pbcore/io/align/BamIO.py", line 198, in init self._loadReferenceInfo() File "/usr/lib/python2.7/dist-packages/pbcore/io/align/BamIO.py", line 73, in _loadReferenceInfo refMD5s = [r["M5"] for r in refRecords] KeyError: 'M5' 2019-05-15 15:55:15,219 [ERROR] 'M5' Traceback (most recent call last): File "/usr/lib/python2.7/dist-packages/pbcommand/cli/core.py", line 136, in _pacbio_main_runner return_code = exe_main_func(*args, kwargs) File "/usr/lib/python2.7/dist-packages/kineticsTools/ipdSummary.py", line 719, in args_runner return kt.start() File "/usr/lib/python2.7/dist-packages/kineticsTools/ipdSummary.py", line 393, in start return self.run() File "/usr/lib/python2.7/dist-packages/kineticsTools/ipdSummary.py", line 455, in run ret = self._mainLoop() File "/usr/lib/python2.7/dist-packages/kineticsTools/ipdSummary.py", line 626, in _mainLoop self.loadSharedAlignmentSet(self.args.alignment_set) File "/usr/lib/python2.7/dist-packages/kineticsTools/ipdSummary.py", line 602, in loadSharedAlignmentSet referenceFastaFname=self.args.reference) File "/usr/lib/python2.7/dist-packages/pbcore/io/dataset/DataSetIO.py", line 2551, in init super(AlignmentSet, self).init(*files, *kwargs) File "/usr/lib/python2.7/dist-packages/pbcore/io/dataset/DataSetIO.py", line 1876, in init super(ReadSet, self).init(files, kwargs) File "/usr/lib/python2.7/dist-packages/pbcore/io/dataset/DataSetIO.py", line 470, in init self.updateCounts() File "/usr/lib/python2.7/dist-packages/pbcore/io/dataset/DataSetIO.py", line 2392, in updateCounts self.assertIndexed() File "/usr/lib/python2.7/dist-packages/pbcore/io/dataset/DataSetIO.py", line 2193, in assertIndexed self._assertIndexed((IndexedBamReader, CmpH5Reader)) File "/usr/lib/python2.7/dist-packages/pbcore/io/dataset/DataSetIO.py", line 1833, in _assertIndexed self._openFiles() File "/usr/lib/python2.7/dist-packages/pbcore/io/dataset/DataSetIO.py", line 1953, in _openFiles resource = IndexedBamReader(location) File "/usr/lib/python2.7/dist-packages/pbcore/io/align/BamIO.py", line 385, in init super(IndexedBamReader, self).init(fname, referenceFastaFname) File "/usr/lib/python2.7/dist-packages/pbcore/io/align/BamIO.py", line 198, in init self._loadReferenceInfo() File "/usr/lib/python2.7/dist-packages/pbcore/io/align/BamIO.py", line 73, in _loadReferenceInfo refMD5s = [r["M5"] for r in refRecords] KeyError: 'M5'

Does somebody know what's happening??? Heeelp please!!! Thanks!!!!

weedcentipede commented 5 years ago

Without further information of the dataset or command lines, I would say that your issue might be raised by an uncompatible .bam, I would recommend you to open a new issue, Cheers, Luis Alfonso.

PD, this is the pipeline (more or less) I used a year ago haha:


bax2bam ./m171222_102849_42244_c101415012550000001823309602281860_s1_p0.{1,2,3}.bax.h5

I concatenated the output of bax2bam into baxxxreads.bam and I did the following:

blasr --bam --out baxxxxreads.bam --nproc 16 ./baxxxreads.bam ./ref.fasta                                                                                                                         
samtools sort -o baxxreads.sorted.bam  baxxxxreads.bam                                                                                                                                               
samtools index  baxxreads.sorted.bam                                                                                                                                                                
pbindex baxxreads.sorted.bam                                                                                                                                                                                         
ipdSummary baxxreads.sorted.bam --reference ref.fasta --identify m6A,m4C,m5C_TET --numWorkers 16 --gff basemod.gff 

Good luck, Feel free to email.