NBISweden / IgDiscover-legacy

Analyze antibody repertoires and discover new V genes from high-throughput sequencing reads
https://www.igdiscover.se
MIT License
17 stars 10 forks source link

Collect more statistics #84

Closed marcelm closed 6 years ago

marcelm commented 6 years ago

We need these numbers (some of them may already exist, need to check):

marcelm commented 6 years ago

A couple of notes.

Files created during preprocessing

  1. The raw reads are in reads.[12].fastq.gz
  2. This is limited to the number of sequences specified by limit: in the configuration and stored in reads/1-limited.[12].fastq.gz
  3. Paired-end reads are merged and written to reads/2-merged.fastq.gz. Statistics of how many read pairs went into the merging and how many could be merged are written to stats/reads.json
  4. Forward primers are trimmed and resulting sequences are written to reads/3-forward-primer-trimmed
  5. Reverse primers are trimmed and resulting sequences are written to reads/4-trimmed. Stats about trimming are written to stats/trimmed.json
  6. a) If barcode grouping and consensus taking is enabled, sequences are grouped by barcode/CDR3 (this removes duplicates implicitly), too short sequences are removed, and the result is written to reads/sequences.fasta.gz (this is done by igdiscover group)
  7. b) Alternatively, if consensus taking is not enabled, barcodes sequences are only removed, duplicate sequences are collapsed, too short sequences are removed and the result is written to reads/sequences.fasta.gz (this is done by igdiscover dereplicate)

Statistics that we already have

The other numbers are iteration specific.

marcelm commented 6 years ago

Implemented now. The file stats/stats.json contains all the statistics.