bioinfo-chru-strasbourg / howard

Highly Open Workflow for Annotation & Ranking toward genomic variant Discovery
GNU Affero General Public License v3.0
6 stars 2 forks source link

Manage sample columns in input file #263

Closed antonylebechec closed 2 months ago

antonylebechec commented 2 months ago

In order to manage sample column in input file, i.e. to check if column are well-formed based on 'FORMAT' column, or to force export of a list of column/sample (even if they are not well-formed), a parameter in param.json could be added. These parameters will be applied only for VCF format output files. Other formats can include extra columns not in VCF format.

Genotype well-formed format correspond to:

  1. value of the column match with '^[0-9.]([/|][0-9.])*' (GT start with whatever number of allele)
  2. value contains genotype annotations:
    • number of annotations in FORMAT column is equal to number of annotation in values (e.g. FORMAT GT:AD:DP:GQ with values 0/1:525,204:729:99, ./.:525,204:729:99, .:525,204:729:99, 0|1:525,204:729:99 or 0/1/2:525,204:729:99)
    • OR value match with ^[.]([/|][.])*$ (no genotype)
antonylebechec commented 2 months ago

Parameters to filter samples in a VCF file.

Parameters file 'param.samples_filter.json':

{
  "samples": {
    "list": ["sample1", "sample2"]
  }
}

Command to export/convert into VCF file:

howard convert --input='tests/data/example.vcf.gz' --output='/tmp/example.filtered.vcf '--param='config/param.samples_filter.json' 
antonylebechec commented 2 months ago

As an example, VCF file contains allowed genotype formats (see example.with_allowed_genotypes.vcf):

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  sample1 sample2 sample3 sample4
chr1    28736   .   A   C   100 PASS    CLNSIG=pathogenic   GT:AD:DP:GQ 0/1:525,204:729:99  0/1:12659,4994:17664:99 1:2:3:4 0/1|2:401,175:576:99
chr1    35144   .   A   C   100 PASS    CLNSIG=non-pathogenic   GT:AD:DP:GQ ./. 0/1:12659,4994:17664:99 0:1:2:3 0/1:401,175:576:99
chr1    69101   .   A   G   100 PASS    DP=50   GT:AD:DP:GQ 0/1:525,204:729:99  ./.:.:.:.   .|. 0/1:401,175:576:99
chr1    768251  .   A   G   100 PASS    .   GT:AD:DP:GQ 0/1:525,204:729:99  ./.:.:.:.   .:1:2:3 0/1:401,175:576:99
chr1    768252  .   A   G   100 PASS    .   GT:AD:DP:GQ 0/1:525,204:729:99  ./.:.:.:.   ././.   0/1:401,175:576:99
chr1    768253  .   A   G   100 PASS    .   GT:AD:DP:GQ 0/1:525,204:729:99  ./. .   0/1:401,175:576:99
chr7    55249063    rs1050171   G   A   5777    PASS    DP=125  GT:AD:DP:GQ 0/1:525,204:729:99  0/1:12659,4994:17664:99 .|.:.:.:.   0/1:401,175:576:99