LukszaLab / NeoantigenEditing

13 stars 7 forks source link

Creation of patient tree clones #1

Open f-huber opened 1 year ago

f-huber commented 1 year ago

Hello,

Thank you very much for sharing this code!

I would be interested in testing your code on samples other than the ones published in the paper, and I wonder how to create the patients tree clones. From my understanding, it requires running PhyloWGS, is that correct? If so, do you have code to share for formatting PhyloWGS output to the format required for your tool? Or, alternatively, could you explain how to convert PhyloWGS output?

Could you also please describe what information is stored in the .json files (especially "x", "X" and "new_x" and information included within the annotated version of the files)?

Thank you very much in advance.

With best regards,

Florian

mluksza commented 1 year ago

Hi Florian, Here is a detailed description of the data format from the json files. For each sample we provide a json file with the following data:

Patient's sample format:

id: str, sample name

patient: str, name of patient

cohort: str, cohort of the patient

OS: float, (optional) survival time

PFS: float (optional) progression free survival time

status: int (optional) dead/alive

HLA_genes: list of str, hla alleles of the patient (optional)

mutations: list

   all mutations observed across all samples of the patient, for each mutation
   report:

       id: str
           format <chrom>_<position>_<ref_nucleotide>_<alt_nucleotide>

       gene: str
         gene name

               missense: int
             1 if missense else 0
      e.g.
  {
    "id": "1_12172228_G_A",
    "gene": "TNFRSF8",
    "missense": 0
  }

neoantigens: list

     all neoantigens observed across all samples of the patient, for each neoantigen report:

       id: str
           format <chrom>_<position>_<ref_nucleotide>_<alt_nucleotide>_<mutated_position>_<peptide_length>_<HLA_allele>
       mutation_id: str
       HLA_gene_id: int
       sequence: str
       WT_sequence: str
       mutated_position: int
       Kd: float
       KdWT: float

       e.g.
       {
        "id": "19_44352078_G_A_5_9_C0303",
        "mutation_id": "19_44352078_G_A",
        "HLA_gene_id": "HLA-C03:03",
        "sequence": "KAFSHGYHL",
        "WT_sequence": "KAFSRGYHL",
        "mutated_position": 5,
        "Kd": 29.0,
        "KdWT": 30.0
       }

sample_trees: list of trees tree format described below

Tree format:

 topology: Node
         root clone node of the tree, Node format described below

 score: float, log-likelihood score (from PhyloWGS)

Node format:

 clone_id: int

 clone_mutations: list
               list of mutation identifiers that originate in that clone, eg.
           ["20_16360370_G_C", "2_89869798_C_A", ...]

  children: list of children nodes, in Node format

  X: float,
     cecular cancer fraction, CCF

  x: float, exclusive frequency (see eq. 8)

  new_x: float, frequency if this is a new clone (optional)