GavinHaLab / CRPCSubtypingPaper

Analysis code used in ctDNA CRPC phenotype manuscript
Other
2 stars 1 forks source link

Questions regarding the execution of XGBClassifier.py for testing #2

Closed shufanzhang closed 8 months ago

shufanzhang commented 8 months ago

Hello @GavinHaLab and @denniepatton,

Thank you very much for sharing the CRPC subtyping paper analysis pipeline. With reference to your reply to alanasweinstein, we were able to successfully construct the corresponding pickle file using the files provided in the CRPCSubtypingPaper/Data/directory.

#train data
LuCaP_df = pd.read_table("GriffinFeatureMatrix_LuCaP.tsv", sep='\t', index_col=0)
Healthy_df = pd.read_table("GriffinFeatureMatrix_HealthyDonor.tsv", sep='\t', index_col=0)
Subtype_df = pd.read_table("RefSubtypes.txt", sep='\t', header=None, index_col=0)
Subtype_df.columns = ['Subtype']

#test data
test_df = pd.read_table("GriffinFeatureMatrix_ULP.tsv", sep='\t', index_col=0)
TFX_df = pd.read_table("ULP_TF_hg19.txt", sep='\t', header=None, index_col=0)
TFX_df.columns = ['TFX']

#add Subtype to train data
train_df = pd.concat([LuCaP_df, Healthy_df], join='inner')
train_df = train_df.join(Subtype_df)
LuCaP_df = LuCaP_df.join(Subtype_df)
Healthy_df = Healthy_df.join(Subtype_df)

#add TFX to test data
test_df = test_df.join(TFX_df)

#save DataFrame as pickle file format
train_df.to_pickle('train_df.pkl')
LuCaP_df.to_pickle('LuCaP_df.pkl')
Healthy_df.to_pickle('Healthy_df.pkl')
test_df.to_pickle('test_df.pkl')

We utilized LuCaP_df.pkl (with the Subtype column) as the test data, and randomly generated the TFX column for it. Referring to XGBClassifier.py, we performed classification testing on the ATAC features and obtained a result of Mean ROC (AUC=0.97).

#import function from XGBClassifier.py
import XGBClassifier as XGB
test_name = 'LuCaP'
model_type = 'XGBoost'
if test_name == 'LuCaP':
    results_direct = '/mnt/XGBoost/LuCaP/'
    if not os.path.exists(results_direct): os.makedirs(results_direct)
    comparisons = {'Subtype': ['ARPC', 'NEPC']}
    pickl = 'LuCaP_df.pkl'
    print("Loading " + pickl)
    df = pd.read_pickle(pickl)

#Add a column of TFX, randomly generated from the range 0.1-0.3
df = df[df['Subtype'].notna()]
df['TFX'] = pd.Series(np.random.uniform(low=0.1, high=0.3, size=len(df)), index=df.index)

#remove AMPC & ARlow Subtype
for comparison, test_list in comparisons.items():
    if comparison == 'Subtype':
        df_sub = df_sub.rename(columns={comparison: 'Subtype'})
        df_sub = df_sub[df_sub['Subtype'] != 'AMPC']
        df_sub = df_sub[df_sub['Subtype'] != 'ARlow']

# categorical tests
sub_groups = ['-ATAC']
for sub_group in sub_groups:
    df_sub_group = pd.concat([df_sub[set(df_sub.columns).intersection({'Subtype', 'TFX'})], df_sub.filter(regex=sub_group)], axis=1)
    XGB.run_experiment(df_sub_group, comparison, results_direct, model_type, False, 'ALL' + sub_group, test_name)

The results from XGBClassifier.py seem to correspond to Figure 4E in the paper. Given the subtypes of known samples and their corresponding cfDNA TFX, different feature columns were selected to evaluate the classification performance of each feature set using XGBoost. This was done to select the optimal features for the ctdPhone classifier.

We have the following questions and would greatly appreciate your response:

  1. Referring to Figure 1, if LuCaP_FM.pkl in XGBClassifier.py is ctDNA, are all values in its TFX column filled with 1?
  2. In lines 566-568 of XGBClassifier.py, there are three lists: pheno46, tf404, and ar10. Based on subset_data(), it is inferred that these are gene lists. How were these genes selected?
  3. Can you elaborate on the selection strategy for the chromatin regions, promoter regions, gene body regions, and TFBS regions shown in Figure 4E? Specifically, how were the bed files for these regions generated, such as using DiffBind?
denniepatton commented 8 months ago

Hi Shufan,

  1. Yes that's right, in any process involving the LuCaPs (and their data) TFX is assumed to be equal to 1.0
  2. These are indeed gene lists to perform subsetting: pheno46 (47 at time of publishing) refers to phenotype-defining genes, see Phenotype Lineage-Specific Gene Marker Selection in the paper's methods; tf404 refers to 404 deferentially expressed transcription factors, see Differential mRNA Expression Analysis in the paper's methods; ar10 simply refers to a subset of the pheno46 which are ARPC-specific and this list was not used in any final results.
  3. The following sections from the methods should point you in the right direction: Differential Histone PTM Analysis ATAC-seq Analysis TFBS Selection from GTRD Gene Body and Promoter Region Selection

Ref: https://aacrjournals.org/cancerdiscovery/article/13/3/632/716743/Nucleosome-Patterns-in-Circulating-Tumor-DNA

shufanzhang commented 8 months ago

Thank you for your helpful response to my question on GitHub. I appreciate the time and effort you put into addressing my concerns. Your guidance had completely solved my problem.