PacificBiosciences / kineticsTools

Tools for detecting DNA modifications from single molecule, real-time sequencing data
19 stars 21 forks source link

IPD not loaded in cmp.h5 #9

Closed jayhesselberth closed 9 years ago

jayhesselberth commented 9 years ago

I have a big PB data set that I am trying to run through kineticsTools.

I run pbalign and get the cmp.h5 file without errors:

2015-06-24 09:13:43,114 [INFO] pbalign version: 0.2.0.141024
2015-06-24 09:13:43,215 [INFO] BlasrService: Align reads to references using blasr.
2015-06-24 09:35:20,179 [INFO] FilterService: Filter alignments using samFilter.
2015-06-24 09:36:12,172 [INFO] OutputService: Genearte the output CMP.H5 file using samtoh5.
2015-06-24 09:37:12,820 [INFO] ForQuiverService: Sort.
2015-06-24 09:37:12,830 [INFO] SortService: Sort a cmp.h5 file using cmph5tools.py.
2015-06-24 09:37:23,926 [INFO] ForQuiverService: LoadPulses.
2015-06-24 09:37:23,935 [INFO] LoadPulsesService: Load pulses using loadPulses.
2015-06-24 09:38:39,966 [INFO] ForQuiverService: LoadChemistry.
2015-06-24 09:38:39,975 [INFO] LoadChemistryService: Load pulses using loadChemistry.py.
2015-06-24 09:38:43,011 [INFO] ForQuiverService: Repack.
2015-06-24 09:38:43,020 [INFO] RepackService: Repack a cmp.h5 file using h5repack.
2015-06-24 09:39:40,582 [INFO] Total time: 1557.47 s.

I then run ipdSummary.py, and it starts off OK:

2015-06-24 09:40:52,984 [INFO] Using Chemistry matched IPD model: /vol1/software/modules-python/python/2.7.8/lib/python2.7/site-packages/kineticsTools-0.5.1-py2.7-linux-x86_64.egg/kineticsTools/resources/P6-C4.h5
2015-06-24 09:41:23,542 [INFO] Available CPUs: 24
2015-06-24 09:41:23,543 [INFO] Requested worker processes: 12
2015-06-24 09:41:23,643 [INFO] Worker KineticWorkerProcess-1 (PID=1926) started running
2015-06-24 09:41:23,660 [INFO] Worker KineticWorkerProcess-2 (PID=1927) started running
2015-06-24 09:41:23,672 [INFO] Worker KineticWorkerProcess-3 (PID=1928) started running
2015-06-24 09:41:23,683 [INFO] Worker KineticWorkerProcess-4 (PID=1929) started running
2015-06-24 09:41:23,698 [INFO] Worker KineticWorkerProcess-5 (PID=1930) started running
2015-06-24 09:41:23,710 [INFO] Worker KineticWorkerProcess-6 (PID=1931) started running
2015-06-24 09:41:23,725 [INFO] Worker KineticWorkerProcess-7 (PID=1932) started running
2015-06-24 09:41:23,737 [INFO] Worker KineticWorkerProcess-8 (PID=1933) started running
2015-06-24 09:41:23,752 [INFO] Worker KineticWorkerProcess-9 (PID=1934) started running
2015-06-24 09:41:23,764 [INFO] Worker KineticWorkerProcess-10 (PID=1935) started running
2015-06-24 09:41:23,781 [INFO] Worker KineticWorkerProcess-11 (PID=1936) started running
2015-06-24 09:41:23,786 [INFO] Launched worker processes.
2015-06-24 09:41:23,797 [INFO] Worker KineticWorkerProcess-12 (PID=1937) started running
2015-06-24 09:41:23,809 [INFO] Launched result collector process.
2015-06-24 09:41:23,819 [INFO] Process KineticsWriter-13 (PID=1938) started running
2015-06-24 09:41:23,967 [INFO] Creating IpdRatio dataset w/ name: ref000001, Size: 1465500
2015-06-24 09:41:23,968 [INFO] Creating IpdRatio dataset w/ name: ref000002, Size: 1456112

... lot more

But then throws these errors:

Process KineticWorkerProcess-1:
Traceback (most recent call last):
  File "/vol1/software/modules-python/python/2.7.8/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/vol1/software/modules-python/python/2.7.8/lib/python2.7/site-packages/kineticsTools-0.5.1-py2.7-linux-x86_64.egg/kineticsTools/WorkerProcess.py", line 109, in run
    self._run()
  File "/vol1/software/modules-python/python/2.7.8/lib/python2.7/site-packages/kineticsTools-0.5.1-py2.7-linux-x86_64.egg/kineticsTools/WorkerProcess.py", line 88, in _run
    result = self.onChunk(datum)
  File "/vol1/software/modules-python/python/2.7.8/lib/python2.7/site-packages/kineticsTools-0.5.1-py2.7-linux-x86_64.egg/kineticsTools/KineticWorker.py", line 197, in onChunk
    result = self._summarizeReferenceRegion((start, end), self.options.methylFraction, self.options.identify)
  File "/vol1/software/modules-python/python/2.7.8/lib/python2.7/site-packages/kineticsTools-0.5.1-py2.7-linux-x86_64.egg/kineticsTools/KineticWorker.py", line 216, in _summarizeReferenceRegion
    (caseChunks, capValue) = self._fetchChunks(caseReferenceGroupId, targetBounds, self.caseCmpH5)
  File "/vol1/software/modules-python/python/2.7.8/lib/python2.7/site-packages/kineticsTools-0.5.1-py2.7-linux-x86_64.egg/kineticsTools/KineticWorker.py", line 403, in _fetchChunks
    rawIpds = self._loadRawIpds(hits, start, end, factor)
  File "/vol1/software/modules-python/python/2.7.8/lib/python2.7/site-packages/kineticsTools-0.5.1-py2.7-linux-x86_64.egg/kineticsTools/KineticWorker.py", line 434, in _loadRawIpds
    rawIpd = downsampleFrames(aln.IPD()) * factor
  File "/vol1/software/modules-python/python/2.7.8/lib/python2.7/site-packages/pbcore-1.0.0-py2.7.egg/pbcore/io/align/CmpH5IO.py", line 192, in f
    return self.pulseFeature(featureName, aligned, orientation)
  File "/vol1/software/modules-python/python/2.7.8/lib/python2.7/site-packages/pbcore-1.0.0-py2.7.egg/pbcore/io/align/CmpH5IO.py", line 549, in pulseFeature
    pulseDataset = self._alignmentGroup[featureName]
KeyError: 'IPD'

indicating to me that I am somehow failing to load the IPD data.

Questions:

Thanks in advance.

jrharting commented 9 years ago

Hi Jay,

All datasets run on the RS have ipd values. Have you tried creating the cmph5 file using the --forQuiver flag? That should load the right values.

Best, John

Sent from my iPhone

On Jun 24, 2015, at 9:07 AM, Jay Hesselberth notifications@github.com<mailto:notifications@github.com> wrote:

I have a big PB data set that I am trying to run through kineticsTools.

I run pbalign and get the cmp.h5 file without errors:

2015-06-24 09:13:43,114 [INFO] pbalign version: 0.2.0.141024 2015-06-24 09:13:43,215 [INFO] BlasrService: Align reads to references using blasr. 2015-06-24 09:35:20,179 [INFO] FilterService: Filter alignments using samFilter. 2015-06-24 09:36:12,172 [INFO] OutputService: Genearte the output CMP.H5 file using samtoh5. 2015-06-24 09:37:12,820 [INFO] ForQuiverService: Sort. 2015-06-24 09:37:12,830 [INFO] SortService: Sort a cmp.h5 file using cmph5tools.py. 2015-06-24 09:37:23,926 [INFO] ForQuiverService: LoadPulses. 2015-06-24 09:37:23,935 [INFO] LoadPulsesService: Load pulses using loadPulses. 2015-06-24 09:38:39,966 [INFO] ForQuiverService: LoadChemistry. 2015-06-24 09:38:39,975 [INFO] LoadChemistryService: Load pulses using loadChemistry.py. 2015-06-24 09:38:43,011 [INFO] ForQuiverService: Repack. 2015-06-24 09:38:43,020 [INFO] RepackService: Repack a cmp.h5 file using h5repack. 2015-06-24 09:39:40,582 [INFO] Total time: 1557.47 s.

I then run ipdSummary.py, and it starts off OK:

2015-06-24 09:40:52,984 [INFO] Using Chemistry matched IPD model: /vol1/software/modules-python/python/2.7.8/lib/python2.7/site-packages/kineticsTools-0.5.1-py2.7-linux-x86_64.egg/kineticsTools/resources/P6-C4.h5 2015-06-24 09:41:23,542 [INFO] Available CPUs: 24 2015-06-24 09:41:23,543 [INFO] Requested worker processes: 12 2015-06-24 09:41:23,643 [INFO] Worker KineticWorkerProcess-1 (PID=1926) started running 2015-06-24 09:41:23,660 [INFO] Worker KineticWorkerProcess-2 (PID=1927) started running 2015-06-24 09:41:23,672 [INFO] Worker KineticWorkerProcess-3 (PID=1928) started running 2015-06-24 09:41:23,683 [INFO] Worker KineticWorkerProcess-4 (PID=1929) started running 2015-06-24 09:41:23,698 [INFO] Worker KineticWorkerProcess-5 (PID=1930) started running 2015-06-24 09:41:23,710 [INFO] Worker KineticWorkerProcess-6 (PID=1931) started running 2015-06-24 09:41:23,725 [INFO] Worker KineticWorkerProcess-7 (PID=1932) started running 2015-06-24 09:41:23,737 [INFO] Worker KineticWorkerProcess-8 (PID=1933) started running 2015-06-24 09:41:23,752 [INFO] Worker KineticWorkerProcess-9 (PID=1934) started running 2015-06-24 09:41:23,764 [INFO] Worker KineticWorkerProcess-10 (PID=1935) started running 2015-06-24 09:41:23,781 [INFO] Worker KineticWorkerProcess-11 (PID=1936) started running 2015-06-24 09:41:23,786 [INFO] Launched worker processes. 2015-06-24 09:41:23,797 [INFO] Worker KineticWorkerProcess-12 (PID=1937) started running 2015-06-24 09:41:23,809 [INFO] Launched result collector process. 2015-06-24 09:41:23,819 [INFO] Process KineticsWriter-13 (PID=1938) started running 2015-06-24 09:41:23,967 [INFO] Creating IpdRatio dataset w/ name: ref000001, Size: 1465500 2015-06-24 09:41:23,968 [INFO] Creating IpdRatio dataset w/ name: ref000002, Size: 1456112

... lot more

But then throws these errors:

Process KineticWorkerProcess-1: Traceback (most recent call last): File "/vol1/software/modules-python/python/2.7.8/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap self.run() File "/vol1/software/modules-python/python/2.7.8/lib/python2.7/site-packages/kineticsTools-0.5.1-py2.7-linux-x86_64.egg/kineticsTools/WorkerProcess.py", line 109, in run self._run() File "/vol1/software/modules-python/python/2.7.8/lib/python2.7/site-packages/kineticsTools-0.5.1-py2.7-linux-x86_64.egg/kineticsTools/WorkerProcess.py", line 88, in _run result = self.onChunk(datum) File "/vol1/software/modules-python/python/2.7.8/lib/python2.7/site-packages/kineticsTools-0.5.1-py2.7-linux-x86_64.egg/kineticsTools/KineticWorker.py", line 197, in onChunk result = self._summarizeReferenceRegion((start, end), self.options.methylFraction, self.options.identify) File "/vol1/software/modules-python/python/2.7.8/lib/python2.7/site-packages/kineticsTools-0.5.1-py2.7-linux-x86_64.egg/kineticsTools/KineticWorker.py", line 216, in _summarizeReferenceRegion (caseChunks, capValue) = self._fetchChunks(caseReferenceGroupId, targetBounds, self.caseCmpH5) File "/vol1/software/modules-python/python/2.7.8/lib/python2.7/site-packages/kineticsTools-0.5.1-py2.7-linux-x86_64.egg/kineticsTools/KineticWorker.py", line 403, in _fetchChunks rawIpds = self._loadRawIpds(hits, start, end, factor) File "/vol1/software/modules-python/python/2.7.8/lib/python2.7/site-packages/kineticsTools-0.5.1-py2.7-linux-x86_64.egg/kineticsTools/KineticWorker.py", line 434, in _loadRawIpds rawIpd = downsampleFrames(aln.IPD()) * factor File "/vol1/software/modules-python/python/2.7.8/lib/python2.7/site-packages/pbcore-1.0.0-py2.7.egg/pbcore/io/align/CmpH5IO.py", line 192, in f return self.pulseFeature(featureName, aligned, orientation) File "/vol1/software/modules-python/python/2.7.8/lib/python2.7/site-packages/pbcore-1.0.0-py2.7.egg/pbcore/io/align/CmpH5IO.py", line 549, in pulseFeature pulseDataset = self._alignmentGroup[featureName] KeyError: 'IPD'

indicating to me that I am somehow failing to load the IPD data.

Questions:

Thanks in advance.

— Reply to this email directly or view it on GitHubhttps://github.com/PacificBiosciences/kineticsTools/issues/9.

jayhesselberth commented 9 years ago

Yes, I use the --forQuiver flag.

For reference, here is how I run pbalign:

h5file=AGv3.run1.cmp.h5
fasta=$HOME/db/AG.v3.fa
pbalign -v --nproc 12 --forQuiver run1.fofn $fasta $h5file

where run1.fofn contains:

../run1/Analysis_Results/m150428_021012_42134_c100799582550000001823174309091590_s1_p0.1.bax.h5
../run1/Analysis_Results/m150428_021012_42134_c100799582550000001823174309091590_s1_p0.2.bax.h5
../run1/Analysis_Results/m150428_021012_42134_c100799582550000001823174309091590_s1_p0.3.bax.h5

And the output of this run is:

+ h5file=AGv3.run1.cmp.h5
+ fasta=/vol1/home/jianbin/db/AG.v3.fa
+ [[ ! -f AGv3.run1.cmp.h5 ]]
+ pbalign -v --nproc 12 --forQuiver run1.fofn /vol1/home/jianbin/db/AG.v3.fa AGv3.run1.cmp.h5
2015-06-24 11:27:39,024 [INFO] pbalign version: 0.2.0.141024
2015-06-24 11:27:39,224 [INFO] BlasrService: Align reads to references using blasr.
2015-06-24 11:53:45,270 [INFO] FilterService: Filter alignments using samFilter.
2015-06-24 11:54:48,212 [INFO] OutputService: Genearte the output CMP.H5 file using samtoh5.
2015-06-24 11:56:00,043 [INFO] ForQuiverService: Sort.
2015-06-24 11:56:00,076 [INFO] SortService: Sort a cmp.h5 file using cmph5tools.py.
2015-06-24 11:56:20,979 [INFO] ForQuiverService: LoadPulses.
2015-06-24 11:56:20,992 [INFO] LoadPulsesService: Load pulses using loadPulses.
2015-06-24 11:59:19,042 [INFO] ForQuiverService: LoadChemistry.
2015-06-24 11:59:19,064 [INFO] LoadChemistryService: Load pulses using loadChemistry.py.
2015-06-24 11:59:23,529 [INFO] ForQuiverService: Repack.
2015-06-24 11:59:23,539 [INFO] RepackService: Repack a cmp.h5 file using h5repack.
2015-06-24 12:00:28,648 [INFO] Total time: 1969.62 s.

It seems to be a valid cmp.h5 file, but doesn't have the IPD group loaded:

IPython 3.2.0 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.

In [1]: from pbcore.io.align import CmpH5Reader

In [2]: h5 = CmpH5Reader('AGv3.run1.cmp.h5')

In [3]: h5.pulseFeaturesAvailable()
Out[3]: [u'DeletionTag', u'MergeQV', u'SubstitutionQV', u'InsertionQV', u'DeletionQV']

But I can see the IPD group in the lambda H5 provided in the kineticsTools test dir:

In [4]: h5_lambda = CmpH5Reader('/vol3/home/jhessel/src/kineticsTools/test/data/c2-c2-lambda-mod-decode.cmp.h5')
In [5]: h5_lambda.pulseFeaturesAvailable()
Out[5]: [u'PulseWidth', u'IPD', u'QualityValue', u'InsertionQV', u'DeletionQV']

I think it would be helpful to see how the lambda cmp.h5 files were created in the test directory. Would you happen to have cmds that created that?

Thanks

dalexander commented 9 years ago

IIRC --forQuiver does not give you the IPD. @ylipacbio, what is the correct call to make to pbalign?

lhon commented 9 years ago

The other option is to load the IPD and other values separately from pbalign using loadPulses, such as through the following:

loadChemistry.py input.fofn aligned_reads0.cmp.h5 loadPulses input.fofn aligned_reads0.cmp.h5 -metrics DeletionQV,IPD,InsertionQV,PulseWidth,QualityValue,MergeQV,SubstitutionQV,DeletionTag -byread cmph5tools.py -vv sort --deep --inPlace aligned_reads0.cmp.h5


From: David Alexander [notifications@github.com] Sent: Wednesday, June 24, 2015 2:18 PM To: PacificBiosciences/kineticsTools Subject: Re: [kineticsTools] IPD not loaded in cmp.h5 (#9)

IIRC --forQuiver does not give you the IPD. @ylipacbiohttps://github.com/ylipacbio, what is the correct call to make to pbalign?

— Reply to this email directly or view it on GitHubhttps://github.com/PacificBiosciences/kineticsTools/issues/9#issuecomment-115015650.

ylipacbio commented 9 years ago

You may use pbalign with --metrics such as

--forQuiver --metrics IPD,DeletionQV,InsertionQV,blabla,blabla

ylipacbio commented 9 years ago

You may also refer to this pbalign's wiki page: how to load additional QVs.

jayhesselberth commented 9 years ago

Great, the new wiki page clarifies this issue.