Closed littletiger311 closed 3 years ago
Hello littertiger311,
Yes, the barcodes have already been removed by the authors before uploading. Also, they concatenated all samples together. However, we can separate them by the second part of the read name - I guess they are flowcell ids. Here I attached a file generated with zcat SRP055007.fastq.gz | awk '{if(NR%4==1) print substr($2,1,3)}' | uniq
. Generally, it extracts the first 3 bits of the flowcell id or whatever. Each sample should have 3_1 and 3_2. You might also notice that there are some 7_1 and 7_2. This is because they have top-ups for 17 individuals. See their paper here.
To summarize, you need to find the boundaries of the individual samples in the FASTQ file, which should be those positions where flowcell id changing from 3_2 to 3_1 (or 7_2 to 3_1 for those 17 samples).
You will end up with 62 individuals - remember there are two parents. For which is which, you can go back to their paper to check the read number.
Best, Chenxi
Hi Dr. Zhou, thank you very much for the instruction. It's really helpful, and saved my day.
No problem. Thanks for using PolyGembler. I am going to close this issue now.
Best, Chenxi
Hi Dr. Zhou, I'm trying to use the Z. japonica F1 RAD-seq data to get familiar with polyGembler, as described in the NG paper. Unfortunately, it seems that barcodes have been lost in the the multiplexed F1 RAD-Seq data downloaded from NCBI (SRP055007) (see a few reads below). I’m wondering whether you have encountered the similar problem and how you solved it. Thank you very much.
The reads below seem to start with the enzyme cutting site (TGCAT), not the barcode sequences.
@SRR1804247.23 3_1101_7443_1668_1/1 TGCATGGTGATCTGTTGTTCGTTCGCATGTCACTGGTCTCTGCTGTTCAGTTCTTCACCCACGCGCCTAGCCATTTGCATCCTGCATGATTGAGG + FFFHHHHFGHJJJJJJJJJJJJJJIJJJJIJJJJJJJJJJJJJJJIJJJJIJJJJJIJIIJIJHHFDDDDDDDDDEEDEEDDDDDDDDDCEDDDD @SRR1804247.24 3_1101_7286_1679_1/1 TGCATTTAGTGCATCTACAGCCCATTTTCGCTCTGTTTTTTCATGGTGACCAAAACATAACATAGATGTGGAAATATGAATTTGGAGTCCACCGG + FFFHHHHHIIJJJJJJJJJJJJJJJJJJIJJJJJGIJJJJJJJJJJJJJJJJJIJJJJJGHHHHHHHFFFFFFEEEFEEDEEDDDDDDDDDDDDD @SRR1804247.25 3_1101_7373_1708_1/1 TGCATTTAACGCATCGGGAGTTCAGGCATTGAGCCCCTTACGTGCGTTCAAGCACATCGAGCACTTTTCGTTGCTCCTCCTTAGATCAATAGCTG + FFFHHHHHJJJJJJJJJJJJHIJJJJJJJJIIGIIJJJJJJJJJJJGIJHHHHHFFFFFEDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDEDDDD @SRR1804247.26 3_1101_7440_1728_1/1 TGCATTCTATTGGTCTAGTAGATAAGCAGTCGCCGCCTTTTATGGGATGATGAGGTCCATTCGCTAGTTTTACATTTTCTAACAGTCTTTAGATC + FFFHHHHHJJJJJIIJIJIJJJJJJJJJJIJIJJJJJJJJJJJJJJIIJJJJIJJHIJIHHHHHFFFEEEEECEEFEEEDEEDDDDDDEDDDDDD @SRR1804247.27 3_1101_7618_1675_1/1 TGCATGATATTTTGGTTTGAAACATTGCGTTAGTCGTAATCCTACTTGTTCTTAGGTATTTATATTTATGTCAGTAACTTGCTGCATAAATATGA + FFFHHHHHJJJJJJJIIJJJJJJJJJJJJJJIJIJJJJJJJJJJJJJJJJJJJJJJHIJJJJJJJJJJHFHHHHHFFFFFFFEEEDDEDDEEFEE @SRR1804247.28 3_1101_8258_1676_1/1 TGCATTGACCCAAGGAAAGGAAAGCACGTCGAAAAATTCTGGGTTCATCATCTTCCAACTCAGTCTCTTTAGATCGGAAGAGCACACGTCTGAAC + FFFHHHHHJJJJJJJJJJJJJJIJJJIJIJJJJJJJJJJJJIJHIIJJJJJIJJJJJHHHHHHFFFFFFEEEEEEDDDDDDDDDDDDDBDDDDDD