cihga39871 / Atria

An accurate and ultra-fast adapter and quality trimming program for Illumina Next-Generation Sequencing (NGS) data.
Other
31 stars 3 forks source link

Metrics for numbers (or percentages) of reads trimmed #10

Closed kalavattam closed 1 year ago

kalavattam commented 1 year ago

Hi, thank you for the very useful and fast-performing tool. I am running it now and examining the output; I am confused as to where I can find metrics on the trimming and processing of the reads—for example, the numbers/percentages of reads trimmed, etc. This information is not in the *.log and *.log.json files. I am running the tool with non-simulated, "real" fastq files from different NGS experiments.

I invoke atria like this:

atria \
    -t "${threads}" \
    -r "${r1_pro}" \
    -R "${r3_pro}" \
    -o "${outdir}" \
    --no-length-filtration

However, do I need to include the argument --stats to see this information? For example,

atria \
    -t "${threads}" \
    -r "${r1_pro}" \
    -R "${r3_pro}" \
    -o "${outdir}" \
    --no-length-filtration \
    --stats

The documentation for --stats is confusing:

--stats               (DEV ONLY) write stats to description lines of
                      r2 reads.

Reading this, it's not clear to me that --stats will give me metrics regarding the numbers/percentages of reads subjected to trimming, quality processing, etc.

In the program, I see utilities for benchmarking the tool with simulated reads, but I need metrics for what the tool is doing to my real data.

Thanks,
Kris

cihga39871 commented 1 year ago

Thank you for your interest in Atria.

The *.log.json file has the counts of "good-read-pairs" and "total-read-pairs".

Currently, Atria does not output detailed stats summary. However, --stats outputs some metrics for each read in the description lines in Read2 outputs (the third line starts with + of fastq file):

Each cell is delimited by '\t'

Res
$r12_trim   # is adapter trimmed (true/false)
$(length(r1.seq))   # length of r1 after adapter trimming. If the length is different from the output, quality trimming is performed.
$(length(r2.seq))   # length of r2 after adapter trimming. If the length is different from the output, quality trimming is performed.
|R1          # the following stats are for development only.
$r1_insert_size
$r1_adapter_score
$r1_insert_size_pe
$r1_pe_score
|R2
$r2_insert_size
$r2_adapter_score
$r2_insert_size_pe
$r2_pe_score
|prob
$r1_adapter_prob
$r2_adapter_prob
$r1_pe_prob
$r2_pe_prob
$r1_head_prob
$r2_head_prob

A small script would be useful to process the data.

Another option is to use fastqc and multiqc to analyze the raw and trimmed fastqs and find the difference.

kalavattam commented 1 year ago

Thank you for the quick response. Following your advice and suggestions, I can take some measurements of the adapter and quality processing. Thanks for making and maintaining this great tool. Will close the issue now.