RyanWangZf / Trial2Vec

Findings of EMNLP'22 | Trial2Vec: Zero-Shot Clinical Trial Document Similarity Search using Self-Supervision
MIT License
19 stars 4 forks source link

[Feature Request] - How to transform the API response to valid format for Trial2Vec? #5

Open JPonsa opened 6 months ago

JPonsa commented 6 months ago

Hi,

I would like to know what is the best way to reshape the output of https://clinicaltrials.gov/api/v2 to be used with Trials2Vec I wrote a quick dirty function (see code below) to get a result to the demo data but I am not sure if the logic applied is 100% correct.

def getClinicalTrialStudy(nct_id:str)->dict:
    import requests
    import json

    url = f"https://clinicaltrials.gov/api/v2/studies/{nct_id}"
    headers = {"accept": "text/csv"}
    response = requests.get(url, headers=headers)

    if response.status_code == 200:
        return json.loads(response.text)
    else:
        print("Request failed with status code:", response.status_code)

def ct_dict2pd(study:dict) ->pd.Series():
    """Reformat the outcome of Clinical Trials API to appropriate format for Trial2Vec

    Parameters
    ----------
    study : dict
        Clinical Trial in obtained from the https://clinicaltrials.gov/api/v2/

    Returns
    -------
    pd.Series
        Outcome in the format 
    """
    nct_id = study['protocolSection']['identificationModule']['nctId']
    description = study['protocolSection']['descriptionModule']['briefSummary']
    title = study['protocolSection']['identificationModule']['officialTitle']
    intervention_name = ', '.join(set( j for i in study['protocolSection']['armsInterventionsModule']['armGroups'] 
                                         for j in i['interventionNames']))
    disease =  ', '.join(sorted(study['protocolSection']['conditionsModule']['conditions']))
    keyword = ', '.join(sorted(study['protocolSection']['conditionsModule']['keywords']))
    outcome_measure = ', '.join(set(i['measure'] for i in study['protocolSection']['outcomesModule']['primaryOutcomes']))
    criteria =  (study['protocolSection']['eligibilityModule']['eligibilityCriteria']
                 .replace("\n* ", "~").replace("\n", "~").replace("~~", "~"))
    reference =  ', '.join(set(i['citation'].split(".")[1].lstrip(" ") for i in study['protocolSection']['referencesModule']['references']))
    overall_status = study['protocolSection']['statusModule']['overallStatus']

    return pd.Series({
        'nct_id':nct_id,
        'description':description,
        'title':title,
        'intervention_name':intervention_name,
        'disease':disease,
        'keyword':keyword,
        'outcome_measure':outcome_measure,
        'criteria':criteria,
        'reference':reference,
        'overall_status':overall_status
        })

study_dict = getClinicalTrialStudy('NCT03760770')
study_pd = ct_dict2pd(study_dict).to_frame().transpose()