Add option for TSV output

whaleyr commented 2 years ago

Add TSV output as an option for Phenotyper and Reporter (perhaps NamedAlleleMatcher too?).

What do we want in the TSV format?
Does this need to anticipate multi-sample runs?
How will this work with warnings/messages/caveats?

This was brought up in group discussion and issue #85

katrinsangkuhl commented 2 years ago

Update on batch mode discussion (Meeting notes from Biobank data analyses call 12/13/21). The discussion was about the matcher and phenotype output

CSV/TSV output format combining the results from all the separate PharmCAT runs
Proposed format: Index by sample with the different genotypes by gene in one file and the phenotypes in a second file; additional log file to include warnings (need further discussion: output after each matcher and phenotyper possible)
Could be provided as additional template scripts on PharmCAT GitHub ( - GitHub wiki to document uses of template scripts; - We should not be responsible for maintaining those scripts)

whaleyr commented 2 years ago

After internal discussion we decided to close this issue about TSV output from the reporter.

The data that comes out of the reporter is quite large and complicated. Showing only the small portion that appears in the first table of the report glosses over a lot of the complexity and documentation that people should know when interpreting the results. We feel it would be a disservice to the user to have an option that discards all that information.

BinglanLi commented 2 years ago

I am reopening this issue after the discussion of reporting a TSV to assist large-scale data analysis. It is not to generate a TSV across all samples of interest as we previously discussed, but to focus on extracting PGx inferences of a single sample.

The purpose is to help calculate PGx frequencies. I think there should be a warning that this TSV output should not be used as a substitute of the report for interpreting a person's PGx testing results or prescribing recommendations.

There should be different tables for calculating different frequencies (genotypes vs phenotypes). And I think we can use base file name for the Sample ID below instead. In addition, the information of present and missing variation in VCF is not listed here because it is helpful for quality check but not so much for PGx frequency estimation.

For genotype frequencies, I am thinking about the following content:	Sample ID	Diplotype Index	Diplotype	Haplotype Index	Haplotype	Function
S1	Diplotype 1	2/3	Haplotype 1	*2	Poor Function	Multiple Diplotypes
S1	Diplotype 1	2/3	Haplotype 2	*3	Poor Function	Multiple Diplotypes
S1	Diplotype 2	4/5	Haplotype 1	*4	Normal Function	Multiple Diplotypes
S1	Diplotype 2	4/5	Haplotype 2	*5	No Function	Multiple Diplotypes
S1	Diplotype 3	6/7	Haplotype 1	*6	Normal Function	Multiple Diplotypes
S1	Diplotype 3	6/7	Haplotype 2	*7	No Function	Multiple Diplotypes

Note

For DPYD, the haplotypes mean the DPYD alleles a person carries.
CYP2C9 rs12777823

For phenotype frequencies, I am thinking about the following content:	Sample ID	Phenotype Index	Phenotype	Diplotype Index	Diplotype	Function
S1	Phenotype 1	Poor Metabolizer	Diplotype 1	2/3	Poor Function/Poor Function	Discrepant Phenotypes
S1	Phenotype 2	Intermediate Metabolizer	Diplotype 1	4/5	Normal Function/No Function	Discrepant Phenotypes
S1	Phenotype 2	Intermediate Metabolizer	Diplotype 2	6/7	Normal Function/No Function	Discrepant Phenotypes

Note:

For DPYD, only report the diplotypes that are used to infer the phenotype
CYP2C9 rs12777823

PharmGKB / PharmCAT

Add option for TSV output #86