clemsgrs / hipt

Re-implementation of HIPT
17 stars 7 forks source link

About label #10

Closed AlexNmSED closed 1 year ago

AlexNmSED commented 1 year ago

Thanks for sharing, a very kind job. The information I found at TCGA somewhat conflicts with what the original HIPT repository mentions. Can you provide the label file for training.

clemsgrs commented 1 year ago

hi, no problem 👍 because I cannot attach files to Github messages, here are 3 simple steps describing how I came up with the labels used for TCGA binary classification:

  1. I loaded original HIPT labels tcga_brca_subset.csv (here)
  2. the previous file has 937 entries, out of which only 875 are used for training/tuning/testing (based on one of the splits files, this one for example) ; this is because we drop all slides that are not IDC or ILC
  3. I generated binary labels based on the oncotree_code column via the following function
def map_otc_to_int(oncotree_code: str, missing_label: int = -1):
    if oncotree_code == 'IDC':
        return 0
    elif oncotree_code == 'ILC':
        return 1
    else:
        return missing_label

doing so, you should end up with 837 case_id mapped to 875 slide_id and the following label counts:

let me know it this helps

AlexNmSED commented 1 year ago

Thank you for your help. It's a very read-friendly job that inspires me a lot.