dereneaton / ipyrad

Interactive assembly and analysis of RAD-seq data sets
http://ipyrad.readthedocs.io
GNU General Public License v3.0
70 stars 39 forks source link

PE 'nnnn' seperater #478

Closed isaacovercast closed 2 years ago

isaacovercast commented 2 years ago

I can't tell if this is always happening (like in all unmerged loci within an assembly) or just sometimes. I have a PE dataset and in some of the loci in the final output files the 'nnnn' separator is still there. It gets upper()'d to 'NNNN' for the loci and phy output files (maybe others), but it's still retained in the alleles file:

JAL_2476_0     ATTCCTGTCGAACCTCATGCATGTACTCCGGTCCGCCGCTCGCACTCCTGTGCGCCGCTCGCACTCCTGTGCGCCGTTCGCAGTCCTGTCCGCCGCTCGCACTCCTGTCTGCCGCTCGCACTCCTGTCTGCCACTCGCAC---nnnnTCC>
JAL_2476_1     ATTCCTGTCGAACCTCATGCATGTACTCCGGTCCGCCGCTCGCACTCCTGTGCGCCGCTCGCACTCCTGTGCGCCGTTCGCAGTCCTGTCCGCCGCTCGCACTCCTGTCTGCCGCTCGCACTCCTGTCTGCCACTCGCAC---nnnnTCC>
JAL_2477_0     ATTCCTGTCGAACCTCATGCATGTACTCCGGTCCGCCGCTCGCACTCCTGTGCGCCGCTCGCACTCCTGTGCGCCGTTCGCAGTCCTGTCCGCCGCTCGCACTCCTGTCTGCCGCTCGCACTCCTGTCTGCCACTCGCAC---nnnnTCC>
JAL_2477_1     ATTCCTGTCGAACCTCATGCATGTACTCCGGTCCGCCGCTCGCACTCCTGTGCGCCGCTCGCACTCCTGTGCGCCGTTCGCAGTCCTGTCCGCCGCTCGCACTCCTGTCTGCCGCTCGCACTCCTGTCTGCCACTCGCAC---nnnnTCC>
JAL_2488_0     ATTCCTGTCGG-CCTCATGCATGTACTCCTGTCCGCCGCTCGCACTCCTGTGCGCCGCTCGCACTCCTGTGCGCCGCTCGCAGTCCTGTCCGCCGCTCGCACTCCTGTCCGCCGCTCGCACTCCTGTCTGCCGCTCGCAC---nnnnTCC>
JAL_2488_1     ATTCCTGTCGG-CCTCATGCATGTACTCCTGTCCGCCGCTCGCACTCCTGTGCGCCGCTCGCACTCCTGTGCGCCGCTCGCAGTCCTGTCCGCCGCTCGCACTCCTGTCCGCCGCTCGCACTCCTGTCTGCCGCTCGCAC---nnnnTCC>
JAL_2490_0     ATTCCTGTCGAACCTCATGCATGTACTCCTGTCCGCCGCTCGCACTCCTGTGCGCCGCTCGCACTCCTGTGCGCCGCTCGCAGTCCTGTCCGCCGCTCGCACTCCTGTCCGCCGCTCGCACTCCTGTCTGCCACTCGCACT--nnnnTCC>

I discovered this because the small 'nnnn' are also still preserved in the seqs hdf5 array as 110 ascii values. Discovered this with locus_extracter.get_locus(). If you tell it as_df=False then you get the seqs back as a string of bases which have been .upper()'d, but if you use as_df=True you get a df of ascii values per base, which include 110's where the small n's are.

Oops! I will look more into why these aren't being removed.

isaacovercast commented 2 years ago

Never mind. This is intentional. I was just being stupid.