churchill-lab / alntools

Preprocess bam file into a compressed alignment incidence matrix (equivalence class)
https://churchill-lab.github.io/alntools/
MIT License
2 stars 1 forks source link

bam2ec bam_utils.py array/list issue #7

Closed averydavisbell closed 3 years ago

averydavisbell commented 3 years ago

I am just getting started with alntools and ran into an issue with bam2ec that I was able to remedy by modifying one line of code; I thought I'd let you know in case this is an issue others might have! (I don't feel confident enough to try to contribute to the codebase.)

Call: Using Anaconda 2.1 (I couldn't get EMASE dependencies to resolve with newer versions of conda) on linux (high performance compute cluster), alntools environment created and loaded as specified in https://churchill-lab.github.io/alntools/ BAM was aligned with bowtie2; has ~3M reads alntools bam2ec -t /storage/home/hcoda1/2/abell65/scratch/testdiptranscripts/firsttest/N2ws263_JU1088/emase.pooled.transcripts.info -c 1 --verbose AP01-88_alignedN2ws263_JU1088.bam AP01-88.bin

Output including error:

[alntools] [12/02/2020 12:19:08 PM] Sample not supplied, using filename: AP01-88_alignedN2ws263_JU1088.bam
[alntools] [12/02/2020 12:19:08 PM] Parsing file information ...
[alntools] [12/02/2020 12:19:09 PM] File parsed in 00:00:00.82, total time: 00:00:00.82
[alntools] [12/02/2020 12:19:09 PM] Calculating 1 chunks
[alntools] [12/02/2020 12:19:09 PM] 1 chunks calculated in 00:00:00.03, total time: 00:00:00.84
[alntools] [12/02/2020 12:19:09 PM] Starting 1 processes ...
[alntools] [12/02/2020 12:20:08 PM] DONE Process ID: 0, File: /storage/scratch1/2/abell65/testemase/work/d3/03f099aa022605de44db6fc7a20a38/_bam2ec.0.bam, 19,626,295 valid alignments processed out of 19,658,429, with 28,333 equivalence classes
[alntools] [12/02/2020 12:20:11 PM] Process 1 done out of 1, combining result
[alntools] [12/02/2020 12:20:11 PM] All results combined in 00:01:01.78, total time: 00:01:02.62
[alntools] [12/02/2020 12:20:11 PM] # Valid Alignments: 19,626,295
[alntools] [12/02/2020 12:20:11 PM] # Main Targets: 183,501
[alntools] [12/02/2020 12:20:11 PM] # Haplotypes: 2
[alntools] [12/02/2020 12:20:11 PM] # Equivalence Classes: 28,333
[alntools] [12/02/2020 12:20:11 PM] # Unique Reads: 3,224,924
[alntools] [12/02/2020 12:20:11 PM] Constructing temp APM structure...
Traceback (most recent call last):
  File "/storage/coda1/p-apaaby3/0/abell65/software/anaconda2.1.0/envs/alntools/bin/alntools", line 4, in <module>
    __import__('pkg_resources').run_script('alntools==0.1.1', 'alntools')
  File "/storage/coda1/p-apaaby3/0/abell65/software/anaconda2.1.0/envs/alntools/lib/python2.7/site-packages/pkg_resources/__init__.py", line 666, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/storage/coda1/p-apaaby3/0/abell65/software/anaconda2.1.0/envs/alntools/lib/python2.7/site-packages/pkg_resources/__init__.py", line 1462, in run_script
    exec(code, namespace, namespace)
  File "/storage/coda1/p-apaaby3/0/abell65/software/anaconda2.1.0/envs/alntools/lib/python2.7/site-packages/alntools-0.1.1-py2.7.egg/EGG-INFO/scripts/alntools", line 29, in <module>
    cli()
  File "/storage/coda1/p-apaaby3/0/abell65/software/anaconda2.1.0/envs/alntools/lib/python2.7/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/storage/coda1/p-apaaby3/0/abell65/software/anaconda2.1.0/envs/alntools/lib/python2.7/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/storage/coda1/p-apaaby3/0/abell65/software/anaconda2.1.0/envs/alntools/lib/python2.7/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/storage/coda1/p-apaaby3/0/abell65/software/anaconda2.1.0/envs/alntools/lib/python2.7/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/storage/coda1/p-apaaby3/0/abell65/software/anaconda2.1.0/envs/alntools/lib/python2.7/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/storage/coda1/p-apaaby3/0/abell65/software/anaconda2.1.0/envs/alntools/lib/python2.7/site-packages/alntools-0.1.1-py2.7.egg/alntools/cli.py", line 66, in bam2ec
    methods.bam2ec(bam_file, ec_file, chunks, directory, number_processes, rangefile, sample, targets)
  File "/storage/coda1/p-apaaby3/0/abell65/software/anaconda2.1.0/envs/alntools/lib/python2.7/site-packages/alntools-0.1.1-py2.7.egg/alntools/methods.py", line 33, in bam2ec
    bam_utils.convert(bam_filename, ec_filename, None, num_chunks=chunks, number_processes=number_processes, temp_dir=directory, range_filename=range_filename, sample=sample, target_filename=target_filename)
  File "/storage/coda1/p-apaaby3/0/abell65/software/anaconda2.1.0/envs/alntools/lib/python2.7/site-packages/alntools-0.1.1-py2.7.egg/alntools/bam_utils.py", line 828, in convert
    read_names=ec_ids.astype(str),

Fix: I simply modified line 828 of the bam_utils.py script to make the ec_ids a numpy array so that the line would work: read_names=np.array(ec_ids).astype(str),

After this change, bam2ec completes:

[alntools] [12/02/2020 12:33:05 PM] Sample not supplied, using filename: AP01-88_alignedN2ws263_JU1088.bam
[alntools] [12/02/2020 12:33:05 PM] Parsing file information ...
[alntools] [12/02/2020 12:33:06 PM] File parsed in 00:00:00.92, total time: 00:00:00.92
[alntools] [12/02/2020 12:33:06 PM] Calculating 1 chunks
[alntools] [12/02/2020 12:33:06 PM] 1 chunks calculated in 00:00:00.03, total time: 00:00:00.94
[alntools] [12/02/2020 12:33:06 PM] Starting 1 processes ...
[alntools] [12/02/2020 12:34:03 PM] DONE Process ID: 0, File: /storage/scratch1/2/abell65/testemase/work/d3/03f099aa022605de44db6fc7a20a38/_bam2ec.0.bam, 19,626,295 valid alignments processed out of 19,658,429, with 28,333 equivalence classes
[alntools] [12/02/2020 12:34:06 PM] Process 1 done out of 1, combining result
[alntools] [12/02/2020 12:34:06 PM] All results combined in 00:01:00.03, total time: 00:01:00.97
[alntools] [12/02/2020 12:34:06 PM] # Valid Alignments: 19,626,295
[alntools] [12/02/2020 12:34:06 PM] # Main Targets: 183,501
[alntools] [12/02/2020 12:34:06 PM] # Haplotypes: 2
[alntools] [12/02/2020 12:34:06 PM] # Equivalence Classes: 28,333
[alntools] [12/02/2020 12:34:06 PM] # Unique Reads: 3,224,924
[alntools] [12/02/2020 12:34:06 PM] Constructing temp APM structure...
[alntools] [12/02/2020 12:34:07 PM] APM Created in 00:00:00.95, total time: 00:01:01.92
[alntools] [12/02/2020 12:34:07 PM] Matrix created in 00:00:00.08, total time: 00:01:02.00
[alntools] [12/02/2020 12:34:07 PM] Generating BIN file...
[alntools] [12/02/2020 12:34:07 PM] FORMAT: 2
[alntools] [12/02/2020 12:34:07 PM] NUMBER OF HAPLOTYPES: 2
[alntools] [12/02/2020 12:34:07 PM] NUMBER OF TARGETS: 183,501
[alntools] [12/02/2020 12:34:07 PM] FILTERED CRS: 1
[alntools] [12/02/2020 12:34:07 PM] Determining mappings...
[alntools] [12/02/2020 12:34:07 PM] A MATRIX: INDPTR LENGTH 28,334
[alntools] [12/02/2020 12:34:07 PM] A MATRIX: NUMBER OF NON ZERO: 115,786
[alntools] [12/02/2020 12:34:07 PM] A MATRIX: LENGTH INDPTR: 28,334
[alntools] [12/02/2020 12:34:07 PM] A MATRIX: LENGTH INDICES: 115,786
[alntools] [12/02/2020 12:34:07 PM] A MATRIX: LENGTH DATA: 115,786
[alntools] [12/02/2020 12:34:07 PM] N MATRIX: NUMBER OF EQUIVALENCE CLASSES: 28,333
[alntools] [12/02/2020 12:34:07 PM] N MATRIX: LENGTH INDPTR: 2
[alntools] [12/02/2020 12:34:07 PM] N MATRIX: NUMBER OF NON ZERO: 28,333
[alntools] [12/02/2020 12:34:07 PM] N MATRIX: LENGTH INDPTR: 2
[alntools] [12/02/2020 12:34:07 PM] N MATRIX: LENGTH INDICES: 28,333
[alntools] [12/02/2020 12:34:07 PM] N MATRIX: LENGTH DATA: 28,333
[alntools] [12/02/2020 12:34:07 PM] /storage/scratch1/2/abell65/testemase/work/d3/03f099aa022605de44db6fc7a20a38/AP01-88.bin created in 00:00:00.78, total time: 00:01:02.78
kbchoi-jax commented 3 years ago

Hi Avery! Thank you very much for your note. It was originally developed in py2 and I am currently porting it to py3. There may be some more hick-ups along this line. I am embarrassed as new conda decided not to support some older dependencies that EMASE had to use. The installation used to be one liner but now it is broken as you pointed out. I will get to it soon. Anyways I see you are performing allele-specific expression analysis. Please feel free to contact me at kb.choi@jax.org if you get any other issue running EMASE.