epi2me-labs / pychopper

cDNA read preprocessing
Other
61 stars 9 forks source link

length filtering not working #33

Closed alexyfyf closed 1 year ago

alexyfyf commented 1 year ago

Hi team,

I found the -z length filtering is not working as expected. I have run some data using command

pychopper -t $num_cores -Q 7 -z 50 \
    -r ${base}_report.pdf -S ${base}_statistics.tsv \
    -w ${base}_rescued.fastq \
    ${raw_reads} ${base}_fulllength.fastq

Bu the resulting fulllength.fastq still contain reads less than 50bp, some even 1 bp

file                                              format  type  num_seqs      sum_len  min_len  avg_len  max_len
SGNex_A549_cDNA_replicate1_run2_fulllength.fastq  FASTQ   DNA    999,113  677,733,559        1    678.3    8,138

An example read from fulllength.fastq file is this

@908:909|5b4d0fcf-b6da-453a-ace1-acb40f9af935 runid=66f131dcf22294c54437ce411861e84a8335277c read=15350 ch=51 start_time=2018-06-29T10:56:22Z strand=-
C
+
"

I see the pychopper log also reporting in the statistic tsv file

ReadStats   LenFail 0.0

and stdout log

Finished processing file: /home/users/allstaff/yan.a/lab_davidson/yan.a/dataset/sgnex/fastq_ont/SGNex_A549_cDNA_replicate1_run2/SGNex_A549_cDNA_replicate1_run2.fastq.gz
Input reads failing mean quality filter (Q < 7.0): 375012 (13.11%)
Output fragments failing length filter (length < 50): 0

Both seems suggesting the filter is not working as expected.

Could you have a look at this issue?

Cheers, Alex

nrhorner commented 1 year ago

Hi @alexyfyf

It appears that length filtering with -z only works if you also use -l <path_to_len_filtered.fastq> Will make it so -l is not required for the next release.

Cheers,

Neil

alexyfyf commented 1 year ago

Thank you Neil. If I apply -l I can see the filtering been done, but the filtered out number is not correct. The code I use is

pychopper -t $num_cores -Q 7 -z $zvalue \
    -r ${base}_report.pdf -S ${base}_statistics.tsv \
    -l ${base}_lengthfail.fastq \
    -w ${base}_rescued.fastq \
    ${raw_reads} ${base}_fulllength.fastq

For example if I gave -Q 7 -z 50 I have

Finished processing file: /home/users/fastq_ont/SGNex_K562_directcDNA_replicate2_run2/SGNex_K562_directcDNA_replicate2_run2.fastq.gz
Input reads failing mean quality filter (Q < 7.0): 127132 (45.37%)
Output fragments failing length filter (length < 50): 4862

and if I gave -Q 7 -z 150 I have

Finished processing file: /home/users/fastq_ont/SGNex_K562_directcDNA_replicate2_run2/SGNex_K562_directcDNA_replicate2_run2.fastq.gz
Input reads failing mean quality filter (Q < 7.0): 222838 (79.53%)
Output fragments failing length filter (length < 150): 3428

These are running on exactly same input. I also manually calculated the quality score with seqkit and found indeed 222838 sequences have q<7.

seqkit seq --max-qual 7 $raw_reads | seqkit fx2tab | wc -l
222838

But I'm not sure how the first result came out.

Could you have a look?

Thank you so much.

Cheers, Alex

nrhorner commented 1 year ago

@alexyfyf

A bug was recently introduce that calculated mean quality scores of the input reads incorrectly. This has now been rectified in V2.7.7

Please get in touch again if you have any more problems.

Thanks,

Neil