Conversion from BED + reference genome sequence to amplicon_file.txt format?

niemasd commented 3 years ago

I'm very interested in using pTrimmer for my trimming needs for high-throughput SARS-CoV-2 amplicon sequencing, but we were given the primer information as a BED file with respect to the reference SARS-CoV-2 genome. Is there any way to easily convert between BED + reference genome FASTA to the amplicon_file.txt format that pTrimmer needs?

XLZH commented 3 years ago

Thanks for your testing of pTrimmer!

Could you please provide a little bit of primer information (BED format) and corresponding reads, so that I can figure out what to do then?

Xiaolong Zhang

在 2021年6月3日，22:58，Niema Moshiri @.***> 写道：

I'm very interested in using pTrimmer for my trimming needs for high-throughput SARS-CoV-2 amplicon sequencing, but we were given the primer information as a BED file with respect to the reference SARS-CoV-2 genome. Is there any way to easily convert between BED + reference genome FASTA to the amplicon_file.txt format that pTrimmer needs?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.

niemasd commented 3 years ago

Thanks for the response! Here's a directory with some example files:

https://github.com/niemasd/ViReflow/tree/main/demo

NC_045512.2.fas is the reference genome sequence
sarscov2_v2_primers_swift.bed is the BED file
test_R1.fastq is the read1 FASTQ
test_R2.fastq is the read2 FASTQ

XLZH commented 3 years ago

I have checked your BED file and FASTQ files, but I found that: (1) It's hard to construct amplicon_file.txt with the given BED file, because it's not easy to determine which two primer sequence belong to a primer pair. (2) I am not sure whether there is something wrong with the primer file (sarscov2_v2_primers_swift.bed), following is an example

# sarscov2_v2_primers_swift.bed
NC_045512.2     182     200     covid19genome_200-29703_s7490_U_88F
NC_045512.2     301     322     covid19genome_200-29703_s7490_U_88R

# construct amplicon.txt with the above primer record
CTGCAGGCTGCTTACGGTT     CACGTCCAACTCAGTTTGCCTG      100

# test_R1.fastq
@A00953:232:HY3GLDRXX:1:2101:22480:18505 1:N:0:CCATACCGCA+TTATCCGCCG
CTGCAGGCTGCTTACGGTTTCGTCCGTGTTGCAGCCGATCATCAGCACATCTAGGTTTTGTCCGGGTGTGACCGAAAGGTAAGATGGAGAGCCTTGTCCCTGGTTTCAACGAGAAAACACACGTCCAACTCAGTTTGCCTAGATCGGAAGA
CTGCAGGCTGCTTACGGTT                                                                                                    CACGTCCAACTCAGTTTGCCTG

Assume that s7490_U_88F and s7490_U_88R is forward and reverse primer sequence, we can extract the primer sequence from FASTA file and get the insert length: 301-200+1=100. However, it is strange that the forward primer sequence (CACGTCCAACTCAGTTTGCCTG) is not 'reverse complement' in your read sequence (because 151nt > 100nt)

Therefore, I hope you can double check your BED file or figure out whether the amplicon is in the condition of 'read-through'

niemasd commented 3 years ago

Thank you for investigating! I reached out to my sequencing collaborators to double check that my information is correct.

Regarding the BED file, we are sure it is correct, as it was given to us directly from the sequencing protocol company (Swift BioSciences), and it has been working well using iVar Trim to trim

I believe you are correct that ...F denotes the forward primer sequence and that ...R denotes the reverse primer sequence. I'm not sure why the reverse primer sequence is not the reverse complement, but I imagine they designed it that way to make trimming easier(?), or at least that's what my sequencing collaborators suggested

XLZH commented 3 years ago

Yeah, you are right! The BED file is correct. Because the primer record with a marker of 'R' should be convert to reverse complement sequence. Therefore, the correct amplicon record should be:

# construct amplicon.txt with the above primer record
CTGCAGGCTGCTTACGGTT     CAGGCAAACTGAGTTGGACGTG    102    covid19genome_200-29703_s7490_U_88

Then, I write a Python script (Bed2Amplicon.py) to convert BED file to Amplicon file (input of pTrimmer)

Attention:
    (1) index file of reference.fasta.fai could be obtained with 'samtools faidx'
    (2) coordinate should be 0-based reference position (like your BED file)
    (3) a marker of 'R' (reverse primer) or 'F' (forward primer) is necessary in the end of 4-th field (also like your BED file)

By the way, I run with your FASTQ file and found only 80% of the reads are properly trimmed. For the untrimmed reads, most of them have a tail of 'GGGGG...'. In addition, part of the read sequence is different from reference genome.

niemasd commented 3 years ago

Awesome, thank you so much! I'm excited to try this out!

Regarding the read trimming issues, thanks for looking into that! We're in the process of trying to optimize our sequencing workflow (which is why I'm interested in pTrimmer), so we're exploring strange things like that to see what's causing them and how to fix them. I appreciate the help!

I made a small update to the Bed2Amplicon.py script's shebang so that it supports a python executable anywhere in the PATH (rather than specifically in /usr/bin/python):

https://github.com/DMU-lilab/pTrimmer/pull/16

DMU-lilab / pTrimmer

Conversion from BED + reference genome sequence to amplicon_file.txt format? #15