I can't tell if this is always happening (like in all unmerged loci within an assembly) or just sometimes. I have a PE dataset and in some of the loci in the final output files the 'nnnn' separator is still there. It gets upper()'d to 'NNNN' for the loci and phy output files (maybe others), but it's still retained in the alleles file:
I discovered this because the small 'nnnn' are also still preserved in the seqs hdf5 array as 110 ascii values. Discovered this with locus_extracter.get_locus(). If you tell it as_df=False then you get the seqs back as a string of bases which have been .upper()'d, but if you use as_df=True you get a df of ascii values per base, which include 110's where the small n's are.
Oops! I will look more into why these aren't being removed.
I can't tell if this is always happening (like in all unmerged loci within an assembly) or just sometimes. I have a PE dataset and in some of the loci in the final output files the 'nnnn' separator is still there. It gets
upper()
'd to 'NNNN' for the loci and phy output files (maybe others), but it's still retained in the alleles file:I discovered this because the small 'nnnn' are also still preserved in the seqs hdf5 array as
110
ascii values. Discovered this withlocus_extracter.get_locus()
. If you tell itas_df=False
then you get the seqs back as a string of bases which have been.upper()
'd, but if you useas_df=True
you get a df of ascii values per base, which include110
's where the small n's are.Oops! I will look more into why these aren't being removed.