Closed sgosline closed 9 months ago
This might be my fault when i fix the cancer type, still investigating.
Duplications are indeed happening here. I think the other_id
is the only unique one.
Thanks for investigating this, should I reassign improve_sample_id by other_id then?
Yes, i think that will be best. As a sanity check, you can check the gene-level data to mak e sure there are not duplications on a gene-by-sample basis. There are generally two sources of these duplications - gene mappings (in which case we should average) or sample duplications (which should be eliminated)
Okay after looking into this, it appears that these were assigned by other_id after all. And other_id was mapped to sample_id in the hcmi data. I'm looking into how to get these to map to the alloquot id level through calling the api and I'll assign those to other_id instead. Then I'll reassign improve id based on that.
Also looking at gene-by-sample, I am getting a few duplicates. Here is a small subset with 2 examples. They have different entrez_id but the same gene_name. How would you like me to handle this - should I still average these by gene_name?
OK, I'd do a filter first on protein_coding
genes - we do not need nucleolar RNA or psuedogenes.
This removed quite a few, but still about 25 duplicates remaining. Here is a representative subset.
Got it - you can just sum the TPM together for multiple isoforms of the same gene. - just make sure you are doing it by entrez_id and not gene_name. In this example it seems the entrez_ids are distinct.
Okay got it. So only sum if the entrez_id is identical (which would be none in the example above)?
That's correct!
Great. Still working on handling the duplication issue... I've dug into this further and its pretty tricky to unravel. For example, there is a patient I've found with 1 case_id, 2 primary diagnoses (primary and metastatic), 3 sample_ids, and 5 aliquot_ids. The value that is populated in the manifest is the file ID, and each file contains multiple samples from the same patient. From which, I can call the API to receive the case ID, sample ID, etc.
But this will map a single sample or aliquot to multiple diagnoses. I've looked though the api fields and I'm not sure how to get around this yet.
This is the code I'm using to call the API:
def fetch_metadata_for_samples(uuids):
"""Fetch metadata for given UUIDs."""
endpoint = "https://api.gdc.cancer.gov/files"
payload = {
"filters": {
"op": "in",
"content": {
"field": "files.file_id",
"value": uuids
}
},
"fields": "cases.sample_ids,cases.case_id,cases.samples.sample_id,cases.samples.portions.analytes.aliquots.aliquot_id,cases.samples.sample_type,cases.diagnoses.tissue_or_organ_of_origin,cases.diagnoses.primary_diagnosis",
"format": "JSON",
"size": str(len(uuids))
}
response = requests.post(endpoint, json=payload)
return response.json()
Ok, this is really fascinating. I think for every patient there will likely be blood and tumor (in organoid or other model) and maybe tumor tissue. This patient also has a metastasis. I would filter for cases.tissue_type is 'Primary tumor' and cases.tumor_descriptor is 'Primary'. Does that reduce the duplicates? I think there will still be different samples.sample_type but hopefully these have distinct measurements...
Still ending up with quite a few cases like this where the aliquot id ends up being the same between metastatic and normal. So two diagnoses for one aliquot/sample.
See bottom two rows:
Can you send me a link to this case id? I wonder if the metastatic diagnosis occurred later, so we might want to take the first diagnosis....
Looks like 3 diagnoses for this one. First metastasis diagnosis was about 100 days later than primary tumor. I think for most that I've seen, the metastasis has been diagnosed at a later date.
I could try filtering out any cases where there are duplicate aliquot ids and "metastatic" is in primary diagnosis?
A sample from a metastatic tumor is different from a sample from an actual metastasis, so I dont want to remove the former. However, in the case where there are different diagnoses, can we just take the earliest one?
I think part of this is due to how I've been generating tables from the API results. Currently reworking the processing loop below to regenerate the tables. Getting closer, but not there yet. I attached API result and current table generation function if you are interesting in seeing.
Short example result from API from manifest with 5 input files :
{'data': {'hits': [{'id': '4e9f9377-151e-431e-aba4-a8ea20f6d781', 'cases': [{'case_id': 'df69a9b0-63a6-47f4-b3a7-beb37edf100d', 'diagnoses': [{'tissue_or_organ_of_origin': 'Colon, NOS', 'primary_diagnosis': 'Adenocarcinoma, metastatic, NOS'}, {'tissue_or_organ_of_origin': 'Colon, NOS', 'primary_diagnosis': 'Adenocarcinoma, NOS'}], 'samples': [{'tumor_descriptor': 'Not Applicable', 'sample_id': '689341dd-4b48-46e6-bba7-765b5832eae9', 'sample_type': 'Blood Derived Normal', 'portions': [{'analytes': [{'aliquots': [{'aliquot_id': 'e0ed9fca-55a4-491c-993c-7a6ae36af3f3'}]}]}]}, {'tumor_descriptor': 'Primary', 'sample_id': '84cb8ace-99bf-43f8-90e6-928714974497', 'sample_type': 'Next Generation Cancer Model', 'portions': [{'analytes': [{'aliquots': [{'aliquot_id': '871b7cac-32a6-4620-b1e0-431671232b72'}]}]}]}]}]}, {'id': 'fc739216-5669-4852-ac56-52ad4312848b', 'cases': [{'case_id': 'df69a9b0-63a6-47f4-b3a7-beb37edf100d', 'diagnoses': [{'tissue_or_organ_of_origin': 'Colon, NOS', 'primary_diagnosis': 'Adenocarcinoma, metastatic, NOS'}, {'tissue_or_organ_of_origin': 'Colon, NOS', 'primary_diagnosis': 'Adenocarcinoma, NOS'}], 'samples': [{'tumor_descriptor': 'Metastatic', 'sample_id': '3893bc3e-c85d-4fa1-bb4e-4706c4881c7e', 'sample_type': 'Next Generation Cancer Model', 'portions': [{'analytes': [{'aliquots': [{'aliquot_id': '072df89e-9311-45da-aee6-4dbacefbbebc'}]}]}]}, {'tumor_descriptor': 'Not Applicable', 'sample_id': '689341dd-4b48-46e6-bba7-765b5832eae9', 'sample_type': 'Blood Derived Normal', 'portions': [{'analytes': [{'aliquots': [{'aliquot_id': 'e0ed9fca-55a4-491c-993c-7a6ae36af3f3'}]}]}]}]}]}, {'id': '311d3a32-1d93-43a6-870f-6568e8e7b01f', 'cases': [{'case_id': 'df69a9b0-63a6-47f4-b3a7-beb37edf100d', 'diagnoses': [{'tissue_or_organ_of_origin': 'Colon, NOS', 'primary_diagnosis': 'Adenocarcinoma, metastatic, NOS'}, {'tissue_or_organ_of_origin': 'Colon, NOS', 'primary_diagnosis': 'Adenocarcinoma, NOS'}], 'samples': [{'tumor_descriptor': 'Not Applicable', 'sample_id': '689341dd-4b48-46e6-bba7-765b5832eae9', 'sample_type': 'Blood Derived Normal', 'portions': [{'analytes': [{'aliquots': [{'aliquot_id': 'e0ed9fca-55a4-491c-993c-7a6ae36af3f3'}]}]}]}, {'tumor_descriptor': 'Primary', 'sample_id': '84cb8ace-99bf-43f8-90e6-928714974497', 'sample_type': 'Next Generation Cancer Model', 'portions': [{'analytes': [{'aliquots': [{'aliquot_id': '871b7cac-32a6-4620-b1e0-431671232b72'}]}]}]}]}]}, {'id': '5090e9ef-5ace-4d27-8374-c4743aceca0c', 'cases': [{'case_id': 'df69a9b0-63a6-47f4-b3a7-beb37edf100d', 'diagnoses': [{'tissue_or_organ_of_origin': 'Colon, NOS', 'primary_diagnosis': 'Adenocarcinoma, metastatic, NOS'}, {'tissue_or_organ_of_origin': 'Colon, NOS', 'primary_diagnosis': 'Adenocarcinoma, NOS'}], 'samples': [{'tumor_descriptor': 'Metastatic', 'sample_id': '3893bc3e-c85d-4fa1-bb4e-4706c4881c7e', 'sample_type': 'Next Generation Cancer Model', 'portions': [{'analytes': [{'aliquots': [{'aliquot_id': '072df89e-9311-45da-aee6-4dbacefbbebc'}]}]}]}, {'tumor_descriptor': 'Not Applicable', 'sample_id': '689341dd-4b48-46e6-bba7-765b5832eae9', 'sample_type': 'Blood Derived Normal', 'portions': [{'analytes': [{'aliquots': [{'aliquot_id': 'e0ed9fca-55a4-491c-993c-7a6ae36af3f3'}]}]}]}]}]}, {'id': '63c3d8b3-47a6-494a-ba9a-bfa21b6bb579', 'cases': [{'case_id': 'feaef50b-0e38-4a04-b632-81ae538e1c22', 'diagnoses': [{'tissue_or_organ_of_origin': 'Stomach, NOS', 'primary_diagnosis': 'Adenocarcinoma, NOS'}, {'tissue_or_organ_of_origin': 'Stomach, NOS', 'primary_diagnosis': 'Adenocarcinoma, metastatic, NOS'}, {'tissue_or_organ_of_origin': 'Stomach, NOS', 'primary_diagnosis': 'Adenocarcinoma, metastatic, NOS'}], 'samples': [{'tumor_descriptor': 'Primary', 'sample_id': '9457021f-1b50-4f63-bac3-e1f09bf30738', 'sample_type': 'Primary Tumor', 'portions': [{'analytes': [{'aliquots': [{'aliquot_id': '9a7a775d-c8b1-46f8-9077-453a47896133'}]}]}]}, {'tumor_descriptor': 'Not Applicable', 'sample_id': 'e1982beb-0907-4cb1-a261-7fca6211cc49', 'sample_type': 'Blood Derived Normal', 'portions': [{'analytes': [{'aliquots': [{'aliquot_id': 'c3a69a3d-c597-4dd6-a0c1-80f5d544f18a'}]}]}]}]}]}], 'pagination': {'count': 5, 'total': 5, 'size': 5, 'from': 0, 'sort': '', 'page': 1, 'pages': 1}}, 'warnings': {}}
Current code to turn this into a table:
def extract_data(data):
extracted = []
for hit in data['data']['hits']:
for case in hit['cases']:
for idx,sample in enumerate(case['samples']):
for portion in sample['portions']:
for analyte in portion['analytes']:
for aliquot in analyte['aliquots']:
if idx < len(case['diagnoses']):
diagnosis = case['diagnoses'][idx]
print(analyte['aliquots'], idx)
extracted.append({
'id': hit['id'],
'case_id': case['case_id'],
'tissue_or_organ_of_origin': diagnosis['tissue_or_organ_of_origin'],
'primary_diagnosis': diagnosis['primary_diagnosis'],
'sample_id': sample['sample_id'],
'sample_type': sample['sample_type'],
'tumor_descriptor': sample.get('tumor_descriptor', None),
'aliquot_id': aliquot['aliquot_id']
})
return extracted
So its quite close to working. Transcriptomics is working well but I noticed that its filtering out about 50% of the results from the mutations and copy_number outputs now.
I dug in a bit deeper and found that this results from a filter on the samples.csv file. Without this filter, aliquot IDS are not unique, so I'm working on different assignment logic for this case: df.tumor_descriptor == "Not Applicable".
High level, this basically comes down to some samples having multiple aliquot IDs.
when i parse out data by improve sample ID i can't distinguish the model type![Screenshot 2023-10-06 at 4 39 58 PM](https://github.com/PNNL-CompBio/candleDataProcessing/assets/1847866/45b731f8-5478-42ec-b8e3-fe654308f26d)
We need to definitely keep track of which samples come from the same patient, but we need each sample to be distinct.