Open voyn opened 1 year ago
Interesting. Can you try running each step of longbow individually to see where exactly the error is happening (you can specify the output to go to a file rather than stdout)? That will help get to the bottom of this. Also, were you running this on a local disk or a distributed filesystem (e.g. nfs)?
As for speed, right now it does take a while to run. You can try running with an explicit number of threads set (-t
) - it should be using most of them by default, though. On my todo list is to test with a newer version of pomegranate
(torchegranate
) that is supposed to be significantly faster.
It's running on local disk. Based on my troubleshooting below, it occurs when piping the input into either longbow filter or longbow extract. Thank you!
Annotate alone works:
(venv) root@7b52f75fcedf:/home/ccs# longbow annotate -m mas_15+sc_10x5p mas15_test_input.bam > test_annotate.bam
[INFO 2023-03-29 21:22:45 annotate] Invoked via: longbow annotate -m mas_15+sc_10x5p mas15_test_input.bam
[INFO 2023-03-29 21:22:45 annotate] Running with 5 worker subprocess(es)
[INFO 2023-03-29 21:22:47 annotate] Using mas_15+sc_10x5p: 15-element MAS-ISO-seq array, single-cell 10x 5' kit
[INFO 2023-03-29 21:22:47 annotate] Annotating 8 reads
Progress: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:05<00:00, 1.38 read/s]
[INFO 2023-03-29 21:22:53 annotate] Annotated 8 reads with 895 total sections.
[INFO 2023-03-29 21:22:53 annotate] Done. Elapsed time: 7.78s. Overall processing rate: 1.03 reads/s.
filter alone works:
(venv) root@7b52f75fcedf:/home/ccs# longbow filter -m mas_15+sc_10x5p test_annotate.bam -o test_filter2.bam
[INFO 2023-03-29 21:27:38 filter] Invoked via: longbow filter -m mas_15+sc_10x5p test_annotate.bam -o test_filter2.bam
[E::idx_find_and_load] Could not retrieve index file for 'test_annotate.bam'
[INFO 2023-03-29 21:27:40 filter] Using mas_15+sc_10x5p: 15-element MAS-ISO-seq array, single-cell 10x 5' kit
[INFO 2023-03-29 21:27:40 filter] Writing reads that conform to the model to: test_filter2.bam
[INFO 2023-03-29 21:27:40 filter] Writing reads that do not conform to the model to: /dev/null
[INFO 2023-03-29 21:27:40 filter] Filtering according to mas_15+sc_10x5p model ordered key adapters: A, B, C, D, E, F, G, H, I, J, K, L, M, N, O, P
Progress: 8 read [00:00, 325.81 read/s]
[INFO 2023-03-29 21:27:40 filter] Done. Elapsed time: 1.8968s.
[INFO 2023-03-29 21:27:40 filter] Total Reads Processed: 8
[INFO 2023-03-29 21:27:40 filter] # Reads Passing Model Filter: 8/8 (100.0000%)
[INFO 2023-03-29 21:27:40 filter] # Reads Failing Model Filter: 0/8 (0.0000%)
[INFO 2023-03-29 21:27:40 filter] Total # correctly ordered key adapters in passing reads: 110
[INFO 2023-03-29 21:27:40 filter] Total # correctly ordered key adapters in failing reads: 0
[INFO 2023-03-29 21:27:40 filter] Avg # correctly ordered key adapters per passing read: 13.7500 [16]
[INFO 2023-03-29 21:27:40 filter] Avg # correctly ordered key adapters per failing read: 0.0000 [16]
Piping annotate into filter fails:
(venv) root@7b52f75fcedf:/home/ccs# longbow annotate -m mas_15+sc_10x5p mas15_test_input.bam | longbow filter > test_filter.bam
[INFO 2023-03-29 21:24:02 annotate] Invoked via: longbow annotate -m mas_15+sc_10x5p mas15_test_input.bam
[INFO 2023-03-29 21:24:02 annotate] Running with 5 worker subprocess(es)
[INFO 2023-03-29 21:24:02 filter] Invoked via: longbow filter
Traceback (most recent call last):
File "/longbow/venv/bin/longbow", line 11, in <module>
load_entry_point('maslongbow', 'console_scripts', 'longbow')()
File "/longbow/venv/lib/python3.7/site-packages/click/core.py", line 1130, in __call__
return self.main(*args, **kwargs)
File "/longbow/venv/lib/python3.7/site-packages/click/core.py", line 1055, in main
rv = self.invoke(ctx)
File "/longbow/venv/lib/python3.7/site-packages/click/core.py", line 1657, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/longbow/venv/lib/python3.7/site-packages/click/core.py", line 1404, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/longbow/venv/lib/python3.7/site-packages/click/core.py", line 760, in invoke
return __callback(*args, **kwargs)
File "/longbow/src/longbow/commands/filter.py", line 93, in main
lb_model = bam_utils.load_model(model, input_bam)
File "/longbow/src/longbow/utils/bam_utils.py", line 235, in load_model
with pysam.AlignmentFile(input_bam_path, "rb", check_sq=False, require_index=False) as bam_file:
File "pysam/libcalignmentfile.pyx", line 751, in pysam.libcalignmentfile.AlignmentFile.__cinit__
File "pysam/libcalignmentfile.pyx", line 950, in pysam.libcalignmentfile.AlignmentFile._open
FileNotFoundError: [Errno 2] could not open alignment file `<stdin>`: No such file or directory
[INFO 2023-03-29 21:24:04 annotate] Using mas_15+sc_10x5p: 15-element MAS-ISO-seq array, single-cell 10x 5' kit
[INFO 2023-03-29 21:24:04 annotate] Annotating 8 reads
Progress: 38%|█████████████████████████████████████████████████▏ | 3/8 [00:03<00:05, 1.16s/ read]
Process Process-7:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
self.run()
File "/usr/local/lib/python3.7/multiprocessing/process.py", line 99, in run
self._target(*self._args, **self._kwargs)
File "/longbow/src/longbow/commands/annotate.py", line 277, in _write_thread_fn
bam_utils.write_annotated_read(read, ppath, is_rc, logp, lb_model, out_bam_file)
File "/longbow/src/longbow/utils/bam_utils.py", line 380, in write_annotated_read
out_bam_file.write(read)
File "pysam/libcalignmentfile.pyx", line 1709, in pysam.libcalignmentfile.AlignmentFile.write
File "pysam/libcalignmentfile.pyx", line 1741, in pysam.libcalignmentfile.AlignmentFile.write
OSError: sam_write1 failed with error code -1
[INFO 2023-03-29 21:24:10 annotate] Annotated 3 reads with 324 total sections.
[INFO 2023-03-29 21:24:10 annotate] Done. Elapsed time: 7.83s. Overall processing rate: 0.38 reads/s.
segment alone works:
(venv) root@7b52f75fcedf:/home/ccs# longbow segment -m mas_15+sc_10x5p test_filter2.bam -o test_segment.bam
[INFO 2023-03-29 21:31:02 segment] Invoked via: longbow segment -m mas_15+sc_10x5p test_filter2.bam -o test_segment.bam
[INFO 2023-03-29 21:31:02 segment] Running with 5 worker subprocess(es)
[INFO 2023-03-29 21:31:02 segment] Using simple splitting mode.
[E::idx_find_and_load] Could not retrieve index file for 'test_filter2.bam'
[INFO 2023-03-29 21:31:04 segment] Using mas_15+sc_10x5p: 15-element MAS-ISO-seq array, single-cell 10x 5' kit
Progress: 0 read [00:00, ? read/s][INFO 2023-03-29 21:31:04 segment] Model has UMI annotation.
[INFO 2023-03-29 21:31:04 segment] Model has Cell Barcode annotation.
[WARNING 2023-03-29 21:31:04 segment] Read m64020_201213_022403/1000000023003/ccs has no CBC. Ignoring.
[INFO 2023-03-29 21:31:05 segment] Segmented 8 reads with 113 total segments.
[INFO 2023-03-29 21:31:05 segment] MAS-seq gain factor: 14.12x
[INFO 2023-03-29 21:31:05 segment] Done. Elapsed time: 2.16s.
Piping filter into segment works:
(venv) root@7b52f75fcedf:/home/ccs# longbow filter -m mas_15+sc_10x5p test_annotate.bam | longbow segment -m mas_15+sc_10x5p > test_segment2.bam
[INFO 2023-03-29 21:34:27 filter] Invoked via: longbow filter -m mas_15+sc_10x5p test_annotate.bam
[E::idx_find_and_load] Could not retrieve index file for 'test_annotate.bam'
[INFO 2023-03-29 21:34:28 segment] Invoked via: longbow segment -m mas_15+sc_10x5p
[INFO 2023-03-29 21:34:28 segment] Running with 5 worker subprocess(es)
[INFO 2023-03-29 21:34:28 segment] Using simple splitting mode.
[INFO 2023-03-29 21:34:29 filter] Using mas_15+sc_10x5p: 15-element MAS-ISO-seq array, single-cell 10x 5' kit
[INFO 2023-03-29 21:34:29 filter] Writing reads that conform to the model to: -
[INFO 2023-03-29 21:34:29 filter] Writing reads that do not conform to the model to: /dev/null
[INFO 2023-03-29 21:34:29 filter] Filtering according to mas_15+sc_10x5p model ordered key adapters: A, B, C, D, E, F, G, H, I, J, K, L, M, N, O, P
Progress: 8 read [00:00, 361.86 read/s]
[INFO 2023-03-29 21:34:29 filter] Done. Elapsed time: 1.8928s.
[INFO 2023-03-29 21:34:29 filter] Total Reads Processed: 8
[INFO 2023-03-29 21:34:29 filter] # Reads Passing Model Filter: 8/8 (100.0000%)
[INFO 2023-03-29 21:34:29 filter] # Reads Failing Model Filter: 0/8 (0.0000%)
[INFO 2023-03-29 21:34:29 filter] Total # correctly ordered key adapters in passing reads: 110
[INFO 2023-03-29 21:34:29 filter] Total # correctly ordered key adapters in failing reads: 0
[INFO 2023-03-29 21:34:29 filter] Avg # correctly ordered key adapters per passing read: 13.7500 [16]
[INFO 2023-03-29 21:34:29 filter] Avg # correctly ordered key adapters per failing read: 0.0000 [16]
[INFO 2023-03-29 21:34:30 segment] Using mas_15+sc_10x5p: 15-element MAS-ISO-seq array, single-cell 10x 5' kit
[INFO 2023-03-29 21:34:30 segment] Model has UMI annotation.
[INFO 2023-03-29 21:34:30 segment] Model has Cell Barcode annotation.
[WARNING 2023-03-29 21:34:30 segment] Read m64020_201213_022403/1000000023003/ccs has no CBC. Ignoring.
[INFO 2023-03-29 21:34:30 segment] Segmented 8 reads with 113 total segments.
[INFO 2023-03-29 21:34:30 segment] MAS-seq gain factor: 14.12x
[INFO 2023-03-29 21:34:30 segment] Done. Elapsed time: 2.19s.
Piping segment into extract fails:
(venv) root@7b52f75fcedf:/home/ccs# longbow segment -m mas_15+sc_10x5p test_filter3.bam | longbow extract -o test_extract.bam
[INFO 2023-03-29 21:36:04 segment] Invoked via: longbow segment -m mas_15+sc_10x5p test_filter3.bam
[INFO 2023-03-29 21:36:04 segment] Running with 5 worker subprocess(es)
[INFO 2023-03-29 21:36:04 segment] Using simple splitting mode.
[E::idx_find_and_load] Could not retrieve index file for 'test_filter3.bam'
[INFO 2023-03-29 21:36:05 extract] Invoked via: longbow extract -o test_extract.bam
[INFO 2023-03-29 21:36:05 extract] Writing extracted read segments to: test_extract.bam
[INFO 2023-03-29 21:36:05 extract] Including 2 flanking bases.
Traceback (most recent call last):
File "/longbow/venv/bin/longbow", line 11, in <module>
load_entry_point('maslongbow', 'console_scripts', 'longbow')()
File "/longbow/venv/lib/python3.7/site-packages/click/core.py", line 1130, in __call__
return self.main(*args, **kwargs)
File "/longbow/venv/lib/python3.7/site-packages/click/core.py", line 1055, in main
rv = self.invoke(ctx)
File "/longbow/venv/lib/python3.7/site-packages/click/core.py", line 1657, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/longbow/venv/lib/python3.7/site-packages/click/core.py", line 1404, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/longbow/venv/lib/python3.7/site-packages/click/core.py", line 760, in invoke
return __callback(*args, **kwargs)
File "/longbow/src/longbow/commands/extract.py", line 145, in main
lb_model = bam_utils.load_model(model, input_bam)
File "/longbow/src/longbow/utils/bam_utils.py", line 235, in load_model
with pysam.AlignmentFile(input_bam_path, "rb", check_sq=False, require_index=False) as bam_file:
File "pysam/libcalignmentfile.pyx", line 751, in pysam.libcalignmentfile.AlignmentFile.__cinit__
File "pysam/libcalignmentfile.pyx", line 950, in pysam.libcalignmentfile.AlignmentFile._open
FileNotFoundError: [Errno 2] could not open alignment file `<stdin>`: No such file or directory
[INFO 2023-03-29 21:36:06 segment] Using mas_15+sc_10x5p: 15-element MAS-ISO-seq array, single-cell 10x 5' kit
Progress: 0 read [00:00, ? read/s][INFO 2023-03-29 21:36:06 segment] Model has UMI annotation.
[INFO 2023-03-29 21:36:06 segment] Model has Cell Barcode annotation.
[WARNING 2023-03-29 21:36:06 segment] Read m64020_201213_022403/1000000023003/ccs has no CBC. Ignoring.
Process Process-7:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
self.run()
File "/usr/local/lib/python3.7/multiprocessing/process.py", line 99, in run
self._target(*self._args, **self._kwargs)
File "/longbow/src/longbow/commands/segment.py", line 290, in _sub_process_write_fn
model_annotates_umi
File "/longbow/src/longbow/commands/segment.py", line 560, in _write_segmented_read
sri
File "/longbow/src/longbow/commands/segment.py", line 614, in _write_split_array_element
bam_out.write(a)
File "pysam/libcalignmentfile.pyx", line 1709, in pysam.libcalignmentfile.AlignmentFile.write
File "pysam/libcalignmentfile.pyx", line 1741, in pysam.libcalignmentfile.AlignmentFile.write
OSError: sam_write1 failed with error code -1
[INFO 2023-03-29 21:36:07 segment] Segmented 2 reads with 28 total segments.
[INFO 2023-03-29 21:36:07 segment] MAS-seq gain factor: 14.00x
[INFO 2023-03-29 21:36:07 segment] Done. Elapsed time: 2.08s.
Extract alone works:
(venv) root@7b52f75fcedf:/home/ccs# longbow extract -m mas_15+sc_10x5p test_segment2.bam -o test_extract.bam
[INFO 2023-03-29 21:37:50 extract] Invoked via: longbow extract -m mas_15+sc_10x5p test_segment2.bam -o test_extract.bam
[INFO 2023-03-29 21:37:50 extract] Writing extracted read segments to: test_extract.bam
[INFO 2023-03-29 21:37:50 extract] Including 2 flanking bases.
[E::idx_find_and_load] Could not retrieve index file for 'test_segment2.bam'
[INFO 2023-03-29 21:37:52 extract] Using mas_15+sc_10x5p: 15-element MAS-ISO-seq array, single-cell 10x 5' kit
[INFO 2023-03-29 21:37:52 extract] Extracting coding region from model mas_15+sc_10x5p: cDNA
Progress: 112 read [00:00, 930.07 read/s]
[INFO 2023-03-29 21:37:52 extract] Done. Elapsed time: 1.96s.
[INFO 2023-03-29 21:37:52 extract] Total # Reads Processed: 112
[INFO 2023-03-29 21:37:52 extract] # Reads Containing Extracted Segments: 112/112 (100.0000%)
[INFO 2023-03-29 21:37:52 extract] Total # Segments Extracted: 112
[INFO 2023-03-29 21:37:52 extract] Total # Segments Skipped: 0
[INFO 2023-03-29 21:37:52 extract] # Segments extracted per read: 1.00
I think I see what is happening, the issue is here. The type of stdin is not str
, and so it is using name
which is <stdin>
. This is because pysam doesn't want to get an actual pipe, it wants the str '-'
. edit: Actually, I imagine that specific code was trying to make this work...did the name of the stdin object change?
I thought we had fixed this type of thing recently but perhaps introduced a different bug instead.
Ah...I was mistaken in terms of what was broken. The thing that changed is that we now try to read the model before we start processing the file, and this causes a problem when filter
is reading from stdin
, because it wants to read the stream twice. In 0.5.37
it opened the file once, read the model from the header (if needed) and then kept going.
The simple fix right now is to pass the model again to each command, e.g.
longbow annotate -m mas_15+sc_10x5p mas15_test_input.bam | tee ann.bam | longbow filter -m mas_15+sc_10x5p | ... etc
But obviously this is kind of clunky. The better fix is to go back to the old ordering, which will take a little bit of work but should be doable.
There seems to be some issue in the 0.6.12 related to pysam, I've copied the output of the small test below. The same issue was observed both in the docker version and when building from source.
The 0.5.37 version on PyPI is working. Also curious what processing rates to expect? I'm seeing ~12reads/sec rate on a 64 core 256Gb machine. Thanks!