marcelm commented 6 years ago

We need these numbers (some of them may already exist, need to check):

marcelm commented 6 years ago

A couple of notes.

Files created during preprocessing

The raw reads are in reads.[12].fastq.gz
This is limited to the number of sequences specified by limit: in the configuration and stored in reads/1-limited.[12].fastq.gz
Paired-end reads are merged and written to reads/2-merged.fastq.gz. Statistics of how many read pairs went into the merging and how many could be merged are written to stats/reads.json
Forward primers are trimmed and resulting sequences are written to reads/3-forward-primer-trimmed
Reverse primers are trimmed and resulting sequences are written to reads/4-trimmed. Stats about trimming are written to stats/trimmed.json
a) If barcode grouping and consensus taking is enabled, sequences are grouped by barcode/CDR3 (this removes duplicates implicitly), too short sequences are removed, and the result is written to reads/sequences.fasta.gz (this is done by igdiscover group)
b) Alternatively, if consensus taking is not enabled, barcodes sequences are only removed, duplicate sequences are collapsed, too short sequences are removed and the result is written to reads/sequences.fasta.gz (this is done by igdiscover dereplicate)

Statistics that we already have

Number of raw reads is in stats/reads.json
Number of merged reads is in stats/reads.json
Number of sequences after grouping by barcode+CDR3 is in reads/sequences.fasta.gz.log
Singletons are also in the same file

The other numbers are iteration specific.

Number of sequences for which V/D/J assignments could be made is the same as the number of sequences as IgBLAST always produces some kind of assignment (assigned.tab.gz always has the same number of rows as there are records in reads/sequences.fasta.gz). The number is the total value in final/stats/assigned.json.
The number of filtered sequences (in filtered.tab.gz) isn’t logged anywhere at the moment
The number of filtered sequences for which a CDR3 could be detected is also not logged.

marcelm commented 6 years ago

Implemented now. The file stats/stats.json contains all the statistics.