Open f-huber opened 2 years ago
Hi Florian, Here is a detailed description of the data format from the json files. For each sample we provide a json file with the following data:
Patient's sample format:
id: str, sample name
patient: str, name of patient
cohort: str, cohort of the patient
OS: float, (optional) survival time
PFS: float (optional) progression free survival time
status: int (optional) dead/alive
HLA_genes: list of str, hla alleles of the patient (optional)
mutations: list
all mutations observed across all samples of the patient, for each mutation
report:
id: str
format <chrom>_<position>_<ref_nucleotide>_<alt_nucleotide>
gene: str
gene name
missense: int
1 if missense else 0
e.g.
{
"id": "1_12172228_G_A",
"gene": "TNFRSF8",
"missense": 0
}
neoantigens: list
all neoantigens observed across all samples of the patient, for each neoantigen report:
id: str
format <chrom>_<position>_<ref_nucleotide>_<alt_nucleotide>_<mutated_position>_<peptide_length>_<HLA_allele>
mutation_id: str
HLA_gene_id: int
sequence: str
WT_sequence: str
mutated_position: int
Kd: float
KdWT: float
e.g.
{
"id": "19_44352078_G_A_5_9_C0303",
"mutation_id": "19_44352078_G_A",
"HLA_gene_id": "HLA-C03:03",
"sequence": "KAFSHGYHL",
"WT_sequence": "KAFSRGYHL",
"mutated_position": 5,
"Kd": 29.0,
"KdWT": 30.0
}
sample_trees: list of trees tree format described below
Tree format:
topology: Node
root clone node of the tree, Node format described below
score: float, log-likelihood score (from PhyloWGS)
Node format:
clone_id: int
clone_mutations: list
list of mutation identifiers that originate in that clone, eg.
["20_16360370_G_C", "2_89869798_C_A", ...]
children: list of children nodes, in Node format
X: float,
cecular cancer fraction, CCF
x: float, exclusive frequency (see eq. 8)
new_x: float, frequency if this is a new clone (optional)
Hello,
Thank you very much for sharing this code!
I would be interested in testing your code on samples other than the ones published in the paper, and I wonder how to create the patients tree clones. From my understanding, it requires running PhyloWGS, is that correct? If so, do you have code to share for formatting PhyloWGS output to the format required for your tool? Or, alternatively, could you explain how to convert PhyloWGS output?
Could you also please describe what information is stored in the .json files (especially "x", "X" and "new_x" and information included within the annotated version of the files)?
Thank you very much in advance.
With best regards,
Florian