dereneaton / ipyrad

Interactive assembly and analysis of RAD-seq data sets
http://ipyrad.readthedocs.io
GNU General Public License v3.0
70 stars 39 forks source link

Barcode file not being read, possibly due to formatting issue on my end? #433

Closed konradtaube closed 3 years ago

konradtaube commented 3 years ago

This may be a little long and I apologize! I just want to make sure I am being thorough in describing my issue. Basically I am getting stuck in the ipyrad program right away, and I believe it is due to my barcode text file but I am not sure.

Here is the error message I am receiving with the following command:

ipyrad -p params-masters-thesis.txt -s 1 -r

ipyrad.assemble.utils.IPyradError: One or more barcodes contain invalid IUPAC nucleotide code characters. Barcodes must contain only characters from this list "RKSYWMCATG". Doublecheck your barcodes file is properly formatted.

Now, when i look at my R1 and R2 files, I get the following for R1:

gunzip -c ./201105_AHLVWJDSXY/McMahan-GBS_S29_L001_R1_001.fastq.gz | head -n 12 @A00589:212:HLVWJDSXY:1:1101:2049:1000 1:N:0:CGATGACCTC+TCGTCTCACG ATCTGACTTGCAGTGATTAACAAAGCTTATCCCGCAGGAGGGCGGTTATTGTGCCTTTAAAGTGTAATATTTCCAGCCGAGGGTCCGAGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGCGATGACCTCATCTCGTATGCCGTCTT + FFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFF:F:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFF:FF,FFF:F:FFFFFFF @A00589:212:HLVWJDSXY:1:1101:2302:1000 1:N:0:CGATGACCTC+TCGTCTCACG TATAAGCAGTGCAGTAACGTGGTGGTGTTTGGTTATGTGTGCGATTTACGACAGCACAGTAAACACACCCGTCATGGCAGAGAGGGACGACTGCTACAGTTCACAAGTTCAGAGGAAATGATCATTTACAGGCTGAAAAAATTATTATTTT + FFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFF: @A00589:212:HLVWJDSXY:1:1101:2555:1000 1:N:0:CGATGACCTC+TCGTCTCACG AAGGCCAACTTGCAGCCACTTCCTTCACAGTCCGATTCATGTACACAGCATCGTTTTTCAGCACATCTTCATTTGTAAGTACACTTTAATAAGAAATATTAACATTTCACTTTCATTTTAATCAGCCACATCGTAACCCAGCTTGAATTAT + FFFF,FFFFFFFF:FFFFFFFF:F:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFF,FFF:FFFFFFFFFFFFF:FFFFFFFFF::FFFFFFFFFFFFFFFFFFFFFFF:FFFFF,FFFFFFFFFFFF,

And for R2:

gunzip -c ./201105_AHLVWJDSXY/McMahan-GBS_S29_L001_R2_001.fastq.gz | head -n 12 @A00589:212:HLVWJDSXY:1:1101:2049:1000 2:N:0:CGATGACCTC+TCGTCTCACG CGGACCCTCGGCTGGAAATATTACACTTTAAAGGCACAATAACCGCCCTCCTGCGGGATAAGCTTTGTTAATCACTGCAAGTCAGATAGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTCGTGGGTTGTGGGGTGGGGGGGGGGGGGGGG + FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFF,F,,F,F:F,,F,,F,FFF,FF::FF, @A00589:212:HLVWJDSXY:1:1101:2302:1000 2:N:0:CGATGACCTC+TCGTCTCACG CGGTTTCTTCCCAAAGTCTCAGGTTTTCTGCACAGTCGAGCCAAACTGACAGAATCACATCTTGTCGCAGTTCTGTGTTTCTTTTCAACCTCATGAGGTCCTCCTCTCCCTTCTTGTCACAACCCAACAGGTGAACTTTTCACCTCCATTT + FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFF:FFFFFFF @A00589:212:HLVWJDSXY:1:1101:2555:1000 2:N:0:CGATGACCTC+TCGTCTCACG CGGGTGTTGGTTCTTGAAGAGTTACTTCTGTTTCTTTGTTGTATAATTCAAGCTGGGTTACGATGTGGCTGATTAAAATGAAAGTGAAATGTTAATATTTCTTATTAAAGTGTACTTACAAATGAAGATGTGCTGAAAAACGATGCTGTGT + FFFFFFFFFFFFFF,FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFF:F

My thought is that it has to be with my barcode text file. Here is the exact file I am using and am pasting from the .txt file, which was given to me by the sequencing facility (UW-Madison):

CUTTER=TGCA

CTCTCCAG 384C_A01 TAATTG 384C_B01 ATCTCGT 384C_C01 GACAACT 384C_D01 CTCGCAA 384C_E01 TGGACACT 384C_F01 TGTCAAT 384C_G01 TCCTGCT 384C_H01 GAACTT 384C_A02 ATGCT 384C_B02 ATTCCAA 384C_C02 GACACACT 384C_D02 CGCGT 384C_E02 CATACGCG 384C_F02 CTATCACT 384C_G02 CTGAACCA 384C_H02 TCTCCGT 384C_A03 TGTACA 384C_B03 AAGCAACT 384C_C03 ACCGA 384C_D03 GTAAG 384C_E03 TGATCGCT 384C_F03 TGCGG 384C_G03 ACTAA 384C_H03 GAGGTCCT 384C_A04 TAGCTAT 384C_B04 CAGCGCAAGA 384C_C04 GCTCGCCAT 384C_D04 TGTACCAG 384C_E04 TGTACGCA 384C_F04 TTGGCGCT 384C_G04 GTTCACA 384C_H04 CATGG 384C_A05 ACTACAAT 384C_B05 GACTAACT 384C_C05 ATGGTGA 384C_D05 TATTGCAG 384C_E05 ATCTGACT 384C_F05 GTCACGA 384C_G05 AACGACCACA 384C_H05 CGCCTCAT 384C_A06 CTTATG 384C_B06 TAGAG 384C_C06 GGCAT 384C_D06 CCGACG 384C_E06 TGGTCAAG 384C_F06 ACCAAG 384C_G06 CCATCCAA 384C_H06 GTTCGGT 384C_A07 GCCGCAAT 384C_B07 CATAAG 384C_C07 TTGAGACAG 384C_D07 ACCGTCCAT 384C_E07 GCGTGCCAGA 384C_F07 CCGAT 384C_G07 TCCTCCA 384C_H07 ACACG 384C_A08 CGCAAGA 384C_B08 ACACAACA 384C_C08 ATATT 384C_D08 GTCTCAACG 384C_E08 CCGCA 384C_F08 TCGTGACAGT 384C_G08 AATTG 384C_H08 TCCGT 384C_A09 TATAAGCAG 384C_B09 ATTCA 384C_C09 ACATGCCAG 384C_D09 TGCCTA 384C_E09 AAGGCCAACT 384C_F09 ACTCCACG 384C_G09 GGTTG 384C_H09 TTCTCA 384C_A10 CTGCCGT 384C_B10 TTCCA 384C_C10 GAGCGCT 384C_D10 TAATTAA 384C_E10 TGTGAGG 384C_F10 TGTTGACG 384C_G10 TACCT 384C_H10 CCAGGA 384C_A11 GGATGA 384C_B11 ACAGAAT 384C_C11 ATACTGAG 384C_D11 CTCCAA 384C_E11 TTAGGA 384C_F11 CCAAGACAGT 384C_G11 CATTGA 384C_H11 TCATT 384C_A12 GAATAGA 384C_B12 TTCTG 384C_C12 ACCTAA 384C_D12 GCGTAG 384C_E12 CGTAGCAACA 384C_F12 AAGCAGA 384C_G12 CAATTGCT 384C_H12

Now, when I compare the above file with the barcode tutorial .txt file (cat ./ipsimdata/rad_example_barcodes.txt), which contains:

1A_0 CATCAT 1B_0 AGTGAT 1C_0 ATGGTA 1D_0 GTAGGA 2E_0 AAAGTG 2F_0 GATATA 2G_0 GAGGAG 2H_0 GGGATT 3I_0 TAATTA 3J_0 TGAGGG 3K_0 TGTAGT 3L_0 GTGTGT

There does seem to be some difference? If my guess is correct and the error IS with my own text file, is there an easy way to make it acceptable for the program? And if not, what might be causing this issue? I can show my params-masters-thesis.txt file contents if that might help.

isaacovercast commented 3 years ago

Hello, More info is always better than less when troubleshooting so don't worry about the length of the post.... Yes, you can see there is one very big difference between your file and the example. The order of sample names and barcode sequences is reversed. If you swap the sample names and the barcode sequences in your file it should work fine. Also it probably would be a good idea to remove that first line as well #CUTTER=TGCA.