Output replaced by N's - Githubissues

MNM-TB commented 3 years ago

HI!

Thank you for great work with the tool. I'm evaluating it for our work in extraction molecular barcodes from NGS amplicon sequencing. In general it looks very promising, but I have one strange outcome. I get many of my reads where the trimmed sequence is replaced with N's Here are two consecutive reads, one that works as expected and one which is replaced with N's an example (For your information, the first 20 bases of every read is a random sequence in the primer to allow for high diversity and clean clustering on the Illumina NextSeq): Input:

@NB502004:151:H7VF2BGXH:1:11101:7494:1075 1:N:0:ATCTCAGG+NTCCTTAC
TTTCGGGGTGTCTATACCCCCATTTCAGGTGTCGTGACCATAAAGGCATCCTTCCAGCTCGACGGCTACGTCAA
+
AAAAAEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEE<EEEEEEEEEAEEEEEEAEEEEE/EEEE<6EEEEAE
@NB502004:151:H7VF2BGXH:1:11101:10444:1078 1:N:0:ATCTCAGG+NTCCTTAC
AGATGGCGAGTTGTAAGGGCCATTTCAGGTGTCGTGATTTTCATTAGATCTGTGTGTTGGCTGTCTCTTATACAC
+
AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE/6EEEEAEEEEEEEEEEEE<</EEEEEAEEEEEEEEAA

Output:


@NB502004:151:H7VF2BGXH:1:11101:7494:1075 1:N:0:ATCTCAGG+NTCCTTAC
ACCATAAAGGCATCCTTCCAG
+
EE<EEEEEEEEEAEEEEEEAE
@NB502004:151:H7VF2BGXH:1:11101:10444:1078 1:N:0:ATCTCAGG+NTCCTTAC
NNNNNNNNNNNNNNNNNNNN
+
!!!!!!!!!!!!!!!!!!!!
``

There are two issues here: The left primer is set to be `CCATTTCAGGTGTCGTGA` I.e., ending with the "A" however, the first, trimmed read still contains this A and I see this in many cases, i.e., that our molecular barcodes become 21bp even though they are 20bp long. 

The second issue is that the read is replaced by N's.

We use the following parameters: `--seqtype single --mismatch 3 --kmer 8 --minqual 20` and version V1.3.3 of pTrimmer

Thank you for the help!

MNM-TB commented 3 years ago

The amplicon file I use for matching is:

#FowrdPrim  ReversePrim InserLength AuxInfo
CCATTTCAGGTGTCGTGA  TTATTGACGTAGCCGTCGAG    20  LV-Lib1BC_rev

XLZH commented 3 years ago

The primer sequence is usually in the beginning of the read. But in your condition, your read is beginning with 20-bp random barcode sequence, not the primer sequence. Therefore, many of your reads failed to trim the primer sequence.

To make pTrimmer compatible to your sequencing data, you need to modify the variables in the code (query.c/line8):

  #define BLEN 6  --->  #define BLEN 30

  Then, recompile the code.

As to your second read (NB502004:151:H7VF2BGXH:1:11101:10444:1078), pTrimmer can't locate the reverse primer sequence, that's the reason that failed to trim the primer sequence and output a series of N's.

  @NB502004:151:H7VF2BGXH:1:11101:7494:1075 1:N:0:ATCTCAGG+NTCCTTAC
  TTTCGGGGTGTCTATACCCCCATTTCAGGTGTCGTGACCATAAAGGCATCCTTCCAGCTCGACGGCTACGTCAA
                     CCATTTCAGGTGTCGTGA                    CTCGACGGCTACGTCAATAA

  @NB502004:151:H7VF2BGXH:1:11101:10444:1078 1:N:0:ATCTCAGG+NTCCTTAC
  AGATGGCGAGTTGTAAGGGCCATTTCAGGTGTCGTGATTTTCATTAGATCTGTGTGTTGGCTGTCTCTTATACAC
                     CCATTTCAGGTGTCGTGA

the first read (7494:1075) can locate both 'forward primer' and reverse complementary of your 'reverse primer'
the second read (10444:1078) can't locate the reverse complementary of your 'reverse primer'

DMU-lilab / pTrimmer

Output replaced by N's #10