Closed niemasd closed 3 years ago
Thanks for your testing of pTrimmer!
Could you please provide a little bit of primer information (BED format) and corresponding reads, so that I can figure out what to do then?
Xiaolong Zhang
在 2021年6月3日,22:58,Niema Moshiri @.***> 写道:
I'm very interested in using pTrimmer for my trimming needs for high-throughput SARS-CoV-2 amplicon sequencing, but we were given the primer information as a BED file with respect to the reference SARS-CoV-2 genome. Is there any way to easily convert between BED + reference genome FASTA to the amplicon_file.txt format that pTrimmer needs?
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.
Thanks for the response! Here's a directory with some example files:
https://github.com/niemasd/ViReflow/tree/main/demo
NC_045512.2.fas
is the reference genome sequencesarscov2_v2_primers_swift.bed
is the BED filetest_R1.fastq
is the read1 FASTQtest_R2.fastq
is the read2 FASTQI have checked your BED file and FASTQ files, but I found that: (1) It's hard to construct amplicon_file.txt with the given BED file, because it's not easy to determine which two primer sequence belong to a primer pair. (2) I am not sure whether there is something wrong with the primer file (sarscov2_v2_primers_swift.bed), following is an example
# sarscov2_v2_primers_swift.bed
NC_045512.2 182 200 covid19genome_200-29703_s7490_U_88F
NC_045512.2 301 322 covid19genome_200-29703_s7490_U_88R
# construct amplicon.txt with the above primer record
CTGCAGGCTGCTTACGGTT CACGTCCAACTCAGTTTGCCTG 100
# test_R1.fastq
@A00953:232:HY3GLDRXX:1:2101:22480:18505 1:N:0:CCATACCGCA+TTATCCGCCG
CTGCAGGCTGCTTACGGTTTCGTCCGTGTTGCAGCCGATCATCAGCACATCTAGGTTTTGTCCGGGTGTGACCGAAAGGTAAGATGGAGAGCCTTGTCCCTGGTTTCAACGAGAAAACACACGTCCAACTCAGTTTGCCTAGATCGGAAGA
CTGCAGGCTGCTTACGGTT CACGTCCAACTCAGTTTGCCTG
Assume that s7490_U_88F and s7490_U_88R is forward and reverse primer sequence, we can extract the primer sequence from FASTA file and get the insert length: 301-200+1=100. However, it is strange that the forward primer sequence (CACGTCCAACTCAGTTTGCCTG) is not 'reverse complement' in your read sequence (because 151nt > 100nt)
Therefore, I hope you can double check your BED file or figure out whether the amplicon is in the condition of 'read-through'
Thank you for investigating! I reached out to my sequencing collaborators to double check that my information is correct.
Regarding the BED file, we are sure it is correct, as it was given to us directly from the sequencing protocol company (Swift BioSciences), and it has been working well using iVar Trim to trim
I believe you are correct that ...F
denotes the forward primer sequence and that ...R
denotes the reverse primer sequence. I'm not sure why the reverse primer sequence is not the reverse complement, but I imagine they designed it that way to make trimming easier(?), or at least that's what my sequencing collaborators suggested
Yeah, you are right! The BED file is correct. Because the primer record with a marker of 'R' should be convert to reverse complement sequence. Therefore, the correct amplicon record should be:
# construct amplicon.txt with the above primer record
CTGCAGGCTGCTTACGGTT CAGGCAAACTGAGTTGGACGTG 102 covid19genome_200-29703_s7490_U_88
Then, I write a Python script (Bed2Amplicon.py) to convert BED file to Amplicon file (input of pTrimmer)
Attention:
(1) index file of reference.fasta.fai could be obtained with 'samtools faidx'
(2) coordinate should be 0-based reference position (like your BED file)
(3) a marker of 'R' (reverse primer) or 'F' (forward primer) is necessary in the end of 4-th field (also like your BED file)
By the way, I run with your FASTQ file and found only 80% of the reads are properly trimmed. For the untrimmed reads, most of them have a tail of 'GGGGG...'. In addition, part of the read sequence is different from reference genome.
Awesome, thank you so much! I'm excited to try this out!
Regarding the read trimming issues, thanks for looking into that! We're in the process of trying to optimize our sequencing workflow (which is why I'm interested in pTrimmer), so we're exploring strange things like that to see what's causing them and how to fix them. I appreciate the help!
I made a small update to the Bed2Amplicon.py
script's shebang so that it supports a python
executable anywhere in the PATH
(rather than specifically in /usr/bin/python
):
I'm very interested in using pTrimmer for my trimming needs for high-throughput SARS-CoV-2 amplicon sequencing, but we were given the primer information as a BED file with respect to the reference SARS-CoV-2 genome. Is there any way to easily convert between BED + reference genome FASTA to the
amplicon_file.txt
format that pTrimmer needs?