PNNL-CompBio / coderdata

Automation scripts and benchmark dataset package for cancer drug prediction deep learning models.
Other
11 stars 3 forks source link

HCMI sample id is not unique #37

Closed sgosline closed 9 months ago

sgosline commented 9 months ago

when i parse out data by improve sample ID i can't distinguish the model type Screenshot 2023-10-06 at 4 39 58 PM

We need to definitely keep track of which samples come from the same patient, but we need each sample to be distinct.

sgosline commented 9 months ago

This might be my fault when i fix the cancer type, still investigating.

sgosline commented 9 months ago

Duplications are indeed happening here. I think the other_id is the only unique one. Screenshot 2023-10-09 at 9 10 49 AM

jjacobson95 commented 9 months ago

Thanks for investigating this, should I reassign improve_sample_id by other_id then?

sgosline commented 9 months ago

Yes, i think that will be best. As a sanity check, you can check the gene-level data to mak e sure there are not duplications on a gene-by-sample basis. There are generally two sources of these duplications - gene mappings (in which case we should average) or sample duplications (which should be eliminated)

jjacobson95 commented 9 months ago

Okay after looking into this, it appears that these were assigned by other_id after all. And other_id was mapped to sample_id in the hcmi data. I'm looking into how to get these to map to the alloquot id level through calling the api and I'll assign those to other_id instead. Then I'll reassign improve id based on that.

jjacobson95 commented 9 months ago

Also looking at gene-by-sample, I am getting a few duplicates. Here is a small subset with 2 examples. They have different entrez_id but the same gene_name. How would you like me to handle this - should I still average these by gene_name?

Screenshot 2023-10-09 at 2 02 07 PM
sgosline commented 9 months ago

OK, I'd do a filter first on protein_coding genes - we do not need nucleolar RNA or psuedogenes.

jjacobson95 commented 9 months ago

This removed quite a few, but still about 25 duplicates remaining. Here is a representative subset.

Screenshot 2023-10-09 at 2 33 57 PM
sgosline commented 9 months ago

Got it - you can just sum the TPM together for multiple isoforms of the same gene. - just make sure you are doing it by entrez_id and not gene_name. In this example it seems the entrez_ids are distinct.

jjacobson95 commented 9 months ago

Okay got it. So only sum if the entrez_id is identical (which would be none in the example above)?

sgosline commented 9 months ago

That's correct!

jjacobson95 commented 9 months ago

Great. Still working on handling the duplication issue... I've dug into this further and its pretty tricky to unravel. For example, there is a patient I've found with 1 case_id, 2 primary diagnoses (primary and metastatic), 3 sample_ids, and 5 aliquot_ids. The value that is populated in the manifest is the file ID, and each file contains multiple samples from the same patient. From which, I can call the API to receive the case ID, sample ID, etc.

But this will map a single sample or aliquot to multiple diagnoses. I've looked though the api fields and I'm not sure how to get around this yet.

This is the code I'm using to call the API:


def fetch_metadata_for_samples(uuids):
    """Fetch metadata for given UUIDs."""
    endpoint = "https://api.gdc.cancer.gov/files"
    payload = {
        "filters": {
            "op": "in",
            "content": {
                "field": "files.file_id",
                "value": uuids
            }
        },
        "fields": "cases.sample_ids,cases.case_id,cases.samples.sample_id,cases.samples.portions.analytes.aliquots.aliquot_id,cases.samples.sample_type,cases.diagnoses.tissue_or_organ_of_origin,cases.diagnoses.primary_diagnosis",
        "format": "JSON",
        "size": str(len(uuids))
    }
    response = requests.post(endpoint, json=payload)
    return response.json()
sgosline commented 9 months ago

Ok, this is really fascinating. I think for every patient there will likely be blood and tumor (in organoid or other model) and maybe tumor tissue. This patient also has a metastasis. I would filter for cases.tissue_type is 'Primary tumor' and cases.tumor_descriptor is 'Primary'. Does that reduce the duplicates? I think there will still be different samples.sample_type but hopefully these have distinct measurements...

jjacobson95 commented 9 months ago

Still ending up with quite a few cases like this where the aliquot id ends up being the same between metastatic and normal. So two diagnoses for one aliquot/sample.

See bottom two rows:

Screenshot 2023-10-10 at 4 04 15 PM
sgosline commented 9 months ago

Can you send me a link to this case id? I wonder if the metastatic diagnosis occurred later, so we might want to take the first diagnosis....

jjacobson95 commented 9 months ago

Looks like 3 diagnoses for this one. First metastasis diagnosis was about 100 days later than primary tumor. I think for most that I've seen, the metastasis has been diagnosed at a later date.

https://portal.gdc.cancer.gov/cases/feaef50b-0e38-4a04-b632-81ae538e1c22?bioId=b96fe84a-9723-4c2e-8145-f4512e76c428

I could try filtering out any cases where there are duplicate aliquot ids and "metastatic" is in primary diagnosis?

sgosline commented 9 months ago

A sample from a metastatic tumor is different from a sample from an actual metastasis, so I dont want to remove the former. However, in the case where there are different diagnoses, can we just take the earliest one?

jjacobson95 commented 9 months ago

I think part of this is due to how I've been generating tables from the API results. Currently reworking the processing loop below to regenerate the tables. Getting closer, but not there yet. I attached API result and current table generation function if you are interesting in seeing.

Short example result from API from manifest with 5 input files : {'data': {'hits': [{'id': '4e9f9377-151e-431e-aba4-a8ea20f6d781', 'cases': [{'case_id': 'df69a9b0-63a6-47f4-b3a7-beb37edf100d', 'diagnoses': [{'tissue_or_organ_of_origin': 'Colon, NOS', 'primary_diagnosis': 'Adenocarcinoma, metastatic, NOS'}, {'tissue_or_organ_of_origin': 'Colon, NOS', 'primary_diagnosis': 'Adenocarcinoma, NOS'}], 'samples': [{'tumor_descriptor': 'Not Applicable', 'sample_id': '689341dd-4b48-46e6-bba7-765b5832eae9', 'sample_type': 'Blood Derived Normal', 'portions': [{'analytes': [{'aliquots': [{'aliquot_id': 'e0ed9fca-55a4-491c-993c-7a6ae36af3f3'}]}]}]}, {'tumor_descriptor': 'Primary', 'sample_id': '84cb8ace-99bf-43f8-90e6-928714974497', 'sample_type': 'Next Generation Cancer Model', 'portions': [{'analytes': [{'aliquots': [{'aliquot_id': '871b7cac-32a6-4620-b1e0-431671232b72'}]}]}]}]}]}, {'id': 'fc739216-5669-4852-ac56-52ad4312848b', 'cases': [{'case_id': 'df69a9b0-63a6-47f4-b3a7-beb37edf100d', 'diagnoses': [{'tissue_or_organ_of_origin': 'Colon, NOS', 'primary_diagnosis': 'Adenocarcinoma, metastatic, NOS'}, {'tissue_or_organ_of_origin': 'Colon, NOS', 'primary_diagnosis': 'Adenocarcinoma, NOS'}], 'samples': [{'tumor_descriptor': 'Metastatic', 'sample_id': '3893bc3e-c85d-4fa1-bb4e-4706c4881c7e', 'sample_type': 'Next Generation Cancer Model', 'portions': [{'analytes': [{'aliquots': [{'aliquot_id': '072df89e-9311-45da-aee6-4dbacefbbebc'}]}]}]}, {'tumor_descriptor': 'Not Applicable', 'sample_id': '689341dd-4b48-46e6-bba7-765b5832eae9', 'sample_type': 'Blood Derived Normal', 'portions': [{'analytes': [{'aliquots': [{'aliquot_id': 'e0ed9fca-55a4-491c-993c-7a6ae36af3f3'}]}]}]}]}]}, {'id': '311d3a32-1d93-43a6-870f-6568e8e7b01f', 'cases': [{'case_id': 'df69a9b0-63a6-47f4-b3a7-beb37edf100d', 'diagnoses': [{'tissue_or_organ_of_origin': 'Colon, NOS', 'primary_diagnosis': 'Adenocarcinoma, metastatic, NOS'}, {'tissue_or_organ_of_origin': 'Colon, NOS', 'primary_diagnosis': 'Adenocarcinoma, NOS'}], 'samples': [{'tumor_descriptor': 'Not Applicable', 'sample_id': '689341dd-4b48-46e6-bba7-765b5832eae9', 'sample_type': 'Blood Derived Normal', 'portions': [{'analytes': [{'aliquots': [{'aliquot_id': 'e0ed9fca-55a4-491c-993c-7a6ae36af3f3'}]}]}]}, {'tumor_descriptor': 'Primary', 'sample_id': '84cb8ace-99bf-43f8-90e6-928714974497', 'sample_type': 'Next Generation Cancer Model', 'portions': [{'analytes': [{'aliquots': [{'aliquot_id': '871b7cac-32a6-4620-b1e0-431671232b72'}]}]}]}]}]}, {'id': '5090e9ef-5ace-4d27-8374-c4743aceca0c', 'cases': [{'case_id': 'df69a9b0-63a6-47f4-b3a7-beb37edf100d', 'diagnoses': [{'tissue_or_organ_of_origin': 'Colon, NOS', 'primary_diagnosis': 'Adenocarcinoma, metastatic, NOS'}, {'tissue_or_organ_of_origin': 'Colon, NOS', 'primary_diagnosis': 'Adenocarcinoma, NOS'}], 'samples': [{'tumor_descriptor': 'Metastatic', 'sample_id': '3893bc3e-c85d-4fa1-bb4e-4706c4881c7e', 'sample_type': 'Next Generation Cancer Model', 'portions': [{'analytes': [{'aliquots': [{'aliquot_id': '072df89e-9311-45da-aee6-4dbacefbbebc'}]}]}]}, {'tumor_descriptor': 'Not Applicable', 'sample_id': '689341dd-4b48-46e6-bba7-765b5832eae9', 'sample_type': 'Blood Derived Normal', 'portions': [{'analytes': [{'aliquots': [{'aliquot_id': 'e0ed9fca-55a4-491c-993c-7a6ae36af3f3'}]}]}]}]}]}, {'id': '63c3d8b3-47a6-494a-ba9a-bfa21b6bb579', 'cases': [{'case_id': 'feaef50b-0e38-4a04-b632-81ae538e1c22', 'diagnoses': [{'tissue_or_organ_of_origin': 'Stomach, NOS', 'primary_diagnosis': 'Adenocarcinoma, NOS'}, {'tissue_or_organ_of_origin': 'Stomach, NOS', 'primary_diagnosis': 'Adenocarcinoma, metastatic, NOS'}, {'tissue_or_organ_of_origin': 'Stomach, NOS', 'primary_diagnosis': 'Adenocarcinoma, metastatic, NOS'}], 'samples': [{'tumor_descriptor': 'Primary', 'sample_id': '9457021f-1b50-4f63-bac3-e1f09bf30738', 'sample_type': 'Primary Tumor', 'portions': [{'analytes': [{'aliquots': [{'aliquot_id': '9a7a775d-c8b1-46f8-9077-453a47896133'}]}]}]}, {'tumor_descriptor': 'Not Applicable', 'sample_id': 'e1982beb-0907-4cb1-a261-7fca6211cc49', 'sample_type': 'Blood Derived Normal', 'portions': [{'analytes': [{'aliquots': [{'aliquot_id': 'c3a69a3d-c597-4dd6-a0c1-80f5d544f18a'}]}]}]}]}]}], 'pagination': {'count': 5, 'total': 5, 'size': 5, 'from': 0, 'sort': '', 'page': 1, 'pages': 1}}, 'warnings': {}}

Current code to turn this into a table:

def extract_data(data):
    extracted = []
    for hit in data['data']['hits']:
        for case in hit['cases']:
            for idx,sample in enumerate(case['samples']):
                for portion in sample['portions']:
                    for analyte in portion['analytes']:
                        for aliquot in analyte['aliquots']:
                            if idx < len(case['diagnoses']):
                                diagnosis = case['diagnoses'][idx]
                                print(analyte['aliquots'], idx)
                                extracted.append({
                                    'id': hit['id'],
                                    'case_id': case['case_id'],
                                    'tissue_or_organ_of_origin': diagnosis['tissue_or_organ_of_origin'],
                                    'primary_diagnosis': diagnosis['primary_diagnosis'],
                                    'sample_id': sample['sample_id'],
                                    'sample_type': sample['sample_type'],
                                    'tumor_descriptor': sample.get('tumor_descriptor', None),
                                    'aliquot_id': aliquot['aliquot_id']
                                })
    return extracted
jjacobson95 commented 9 months ago

So its quite close to working. Transcriptomics is working well but I noticed that its filtering out about 50% of the results from the mutations and copy_number outputs now.

I dug in a bit deeper and found that this results from a filter on the samples.csv file. Without this filter, aliquot IDS are not unique, so I'm working on different assignment logic for this case: df.tumor_descriptor == "Not Applicable".

High level, this basically comes down to some samples having multiple aliquot IDs.