Illumina / BeadArrayFiles

Python library to parse file formats related to Illumina bead arrays
45 stars 33 forks source link

Error on calling TOP_strand genotypes #28

Closed zihhuafang closed 3 years ago

zihhuafang commented 3 years ago

Hi thanks for providing this python library for array genotype calling.

I wanted to include top_strand genotypes as a part of the final report by including top_strand_genotypes = gtc.get_base_calls() in the gtc_final_report.py as shown below.

for gtc_file in samples:
        sys.stderr.write("Processing " + gtc_file + "\n")
        gtc_file = os.path.join(args.gtc_directory, gtc_file)
        gtc = GenotypeCalls(gtc_file)
        genotypes = gtc.get_genotypes()
        top_strand_genotypes = gtc.get_base_calls()
        plus_strand_genotypes = gtc.get_base_calls_plus_strand(manifest.snps, manifest.ref_strands)
        forward_strand_genotypes = gtc.get_base_calls_forward_strand(manifest.snps, manifest.source_strands)
        normalized_intensities = gtc.get_normalized_intensities(manifest.normalization_lookups)
        b_allele_freq = gtc.get_ballele_freqs()
        logr_ratio = gtc.get_logr_ratios()

        assert len(genotypes) == len(manifest.names)
        for (name, chrom, map_info, genotype, top_strand_genotype, ref_strand_genotype, source_strand_genotype, (x_norm, y_norm), b_freq, log_r_ratio) in zip(manifest.names, manifest.chroms, manifest.map_infos, genotypes, top_strand_genotypes, plus_strand_genotypes, forward_strand_genotypes, normalized_intensities, b_allele_freq, logr_ratio):
            output_handle.write(delim.join([name, os.path.basename(gtc_file)[:-4], chrom, str(map_info), code2genotype[genotype], top_strand_genotype, ref_strand_genotype, source_strand_genotype, str(x_norm), str(y_norm), str(b_freq), str(log_r_ratio)])  + "\n")

However, I encountered the issue below.

Traceback (most recent call last):
  File "gtc_gp2_final_report.py", line 57, in <module>
    output_handle.write(delim.join([name, os.path.basename(gtc_file)[:-4], chrom, str(map_info), code2genotype[genotype], top_strand_genotype, ref_strand_genotype, source_strand_genotype, str(x_norm), str(y_norm), str(b_freq), str(log_r_ratio)])  + "\n")
TypeError: sequence item 5: expected str instance, bytes found

When I removed the parts related to top_strand_genotype, the script worked. I am not sure what went wrong and how to modify it. If someone has experience to call TOP genotypes, I would appreciate your input on this matter.

Thanks in advance. Zih-Hua

jjzieve commented 3 years ago

@zihhuafang I was able to reproduce this. Looks like it can return a list of byte arrays on https://github.com/Illumina/BeadArrayFiles/blob/develop/module/GenotypeCalls.py#L399 vs. a list of strings from https://github.com/Illumina/BeadArrayFiles/blob/dc4eb370fa97582db3857680cbf3071cde9a6ec5/module/GenotypeCalls.py#L307

For a quick fix, I think you can just cast top genotypes to a string when you write it. i.e. str(top_strand_genotype) in your example.

zihhuafang commented 3 years ago

@jjzieve Thanks a lot for the quick fix. I ended up with str(top_strand_genotype,'UTF-8') to print the genotypes without b'XX'. Thanks again!