EGA-archive / ont2cram

Oxford Nanopore HDF/Fast5 to CRAM conversion tool
Apache License 2.0
22 stars 2 forks source link

TypeError: a bytes-like object is required, not 'str' #15

Closed jkbonfield closed 4 years ago

jkbonfield commented 4 years ago

cram2ont sometimes fails if the byte array being decoded contains string objects:

Traceback (most recent call last):
  File "./cram2ont", line 5, in <module>
    sys.exit(cram2ont.main())
  File "/home/ubuntu/ont2cram/cram2ont.py", line 178, in main
    run( args.inputfile, args.outputdir )
  File "/home/ubuntu/ont2cram/cram2ont.py", line 171, in run
    cram_to_fast5( input_file, output_dir )
  File "/home/ubuntu/ont2cram/cram2ont.py", line 137, in cram_to_fast5
    dtype=[(col_name, a.type)] 
TypeError: a bytes-like object is required, not 'str'

In this case the relevant code is:

                        if col_name=="noname":
                            #print(f"path={a.path}, val={tag_val[:11]}")
                            dset.append(tag_val)
                        else:
                                dset.append(
                                np.array( 
                                    list(tag_val.split('\x03')) if a.type.startswith(('S','U')) else tag_val, 
                                    dtype=[(col_name, a.type)] 
                                )
                             )

The a.type is 'S5' so it does the list(tag_val.split()) bit, but this results in a list of strings:

['CTTTC', 'GTTTC', 'GTTTC', 'GTTTC', 'GTTTC', 'GTTTC', 'GTTTC', 'GTTTC', 'GTTTC', 'GTTTC', 'GTTTC', 'GTTTC', 'GTTTC', 'GTTTC', 'GTTTC', 'GTTTC', 'GTTTC', 'GTTTC', 'GTTTC', 'GTTTC', 'GTTTC', 'GTTTC', 'GTTTC', 'GTTTC', 'TTTTC', 'TTTTC', 'TTTTC', 'CCCTA', 'CCCTA', 'CCCTA', 'CCCTA', 'CCCTA', 'CCCGC', 'CCCTA', 'CCCTA', ... ]

It appears the np.array object requires bytes, ie a list like [b'CTTTC', b'GTTTC' ...]. This patch appears to solve this as h5dump now contains the same data in the Events column as before, but I'm rather thrashing in the dark when it comes to Python.

The input data here was test_data/single-read-1/line_br.fast5. Without this change a round-trip fails with the error above.

diff --git a/cram2ont.py b/cram2ont.py
index 0acf964..f5c5092 100755
--- a/cram2ont.py
+++ b/cram2ont.py
@@ -131,12 +131,13 @@ def cram_to_fast5(cram_filename, output_dir):
                             #print(f"path={a.path}, val={tag_val[:11]}")
                             dset.append(tag_val)
                         else:
-                               dset.append(
-                                np.array( 
-                                    list(tag_val.split('\x03')) if a.type.startswith(('S','U')) else
 tag_val, 
-                                    dtype=[(col_name, a.type)] 
-                                )
-                             )
+                            if a.type.startswith(('S', 'U')):
+                                tag_split=tag_val.split('\x03')
+                                for i, x in enumerate(tag_split):
+                                    tag_split[i]=x.encode('utf-8')
+                                dset.append(np.array(tag_split, dtype=[(col_name, a.type)]))
+                            else:
+                                dset.append(np.array(tag_val, dtype=[(col_name, a.type)]))
                 for dset_name,columns in DSETS.items():
                     d = columns[0] if len(columns)==1 else rfn.merge_arrays(columns, flatten=True, u
semask=False)
                     f.create_dataset( dset_name, data=d )