DMU-lilab / pTrimmer

Used to trim off the primer sequence from mutiplex amplicon sequencing
GNU General Public License v3.0
21 stars 5 forks source link

zero-nucleotide trimmed reads are included in the trimmed FASTQ file. #11

Closed MNM-TB closed 3 years ago

MNM-TB commented 3 years ago

I have an issue which I don't know how to get around.

I'm trying to utilize the pTrimmer trim 20bp molecules barcodes out from NGS sequenced amplicons.

The amplicon file I use looks like this:

#FowrdPrim  ReversePrim InserLength AuxInfo
CCATTTCAGGTGTCGTGA  TAAAGATCTTTTATTGACGTAGCCGTCGAG  20  LV-Lib1BC_rev

In a very small fraction f our amplicons, we lack an inserted barcode. The alignment of those two "primer" sequences results in a 0-bp read. This is however still included in the output file but then just like this:

@NB502004:151:H7VF2BGXH:1:12301:2621:3426 1:N:0:ATCTCAGG+CTCCTTAC
+

All my downstream tools choke on such lines, many just crashes.

I'm wondering how to discard all 0-bp reads. I would also wish to be able to only include those in a length range e.g., 18-22 bases. Can this be possible?

Thanks again for a great tool!

XLZH commented 3 years ago

For your purpose, you can modify the code to add the filter rules.

query.c/PrimTrim [line 304]

void PrimTrim(fastq_t *fq, query_t *Q, arginfo_t *arg)
{
    read_t *read;
    char *seq, *qual;
    int discard=0;

    read = &fq->cache[fq->bufnum++];
    seq = fq->read.seq; qual = fq->read.qual;

    strcpy(read->name, fq->read.name); 
    strcpy(read->mark, fq->read.mark);

    if (Q->isfind) { // find the primer sequence 
        strcpy(read->seq, seq+Q->pstart);
        strcpy(read->qual, qual+Q->pstart);
        if (Q->pend) {
            /* read through */
            int inlen = Q->pend - Q->pstart +1;
            strcpy(read->seq+inlen, "\n");
            strcpy(read->qual+inlen, "\n");
        }
        float q = MeanQuality(read->qual, arg->phred);
        if (q < arg->args->minqual) {
            Q->badqual = 1;
            if (arg->args->keep) {
                strcpy(read->seq, fq->read.seq);
                strcpy(read->qual, fq->read.qual);
            }
            else discard = 1;
        }
        if (strlen(read->seq) < 19 || strlen(read->seq) > 23) discard = 1; // ADD THIS LINE
    }
    else { // can't find the primer sequence
        if (arg->args->keep) {
            strcpy(read->seq, fq->read.seq);
            strcpy(read->qual, fq->read.qual);
        }
        else discard = 1;
    }
    if (discard) {
        strcpy(read->seq, "NNNNNNNNNNNNNNNNNNNN\n");
        strcpy(read->qual, "!!!!!!!!!!!!!!!!!!!!\n");
    }

    return ;
}

Then recompile the code!