PNNL-CompBio / decomprolute

A suite of scientific workflows to assess metrics to compare efficacy of protein-based tumor deconvolution algorithms.
MIT License
15 stars 4 forks source link

Update getAllDatasets.py with additional cancer sets #48

Closed annapamma closed 3 years ago

annapamma commented 3 years ago

It appears there's a discrepancy between the datasets that have been loaded onto the Docker image and what can be saved to file.

I noticed that protDataSetsCLI.py can accept the following values for cancerType:

brca ccrcc colon ovarian endometrial gbm hnscc lscc luad

However, getAllDatasets.py only installs the data for ['brca', 'ccrcc', 'endometrial', 'colon', 'ovarian', 'luad'].

It looks like this list should be updated to: ['brca', 'ccrcc', 'endometrial', 'colon', 'ovarian', 'luad', 'gbm', 'hnscc', 'lscc']

sgosline commented 3 years ago

Assigned to myself - some of these cancers require a password that I do not have to download, but I think some should be public. I will fix.

annapamma commented 3 years ago

At the time of this comment, the CPTAC dataset availability is as follows:

Dataset name Description Data reuse status Publication link
Brca breast cancer no restrictions https://pubmed.ncbi.nlm.nih.gov/33212010/
Ccrcc clear cell renal cell carcinoma (kidney) no restrictions https://pubmed.ncbi.nlm.nih.gov/31675502/
Colon colorectal cancer no restrictions https://pubmed.ncbi.nlm.nih.gov/31031003/
Endometrial endometrial carcinoma (uterine) no restrictions https://pubmed.ncbi.nlm.nih.gov/32059776/
**Gbm glioblastoma password access only unpublished**
Hnscc head and neck squamous cell carcinoma no restrictions https://pubmed.ncbi.nlm.nih.gov/33417831/
**Lscc lung squamous cell carcinoma password access only unpublished**
Luad lung adenocarcinoma no restrictions https://pubmed.ncbi.nlm.nih.gov/32649874/
Ovarian high grade serous ovarian cancer no restrictions https://pubmed.ncbi.nlm.nih.gov/27372738/
**Pdac pancreatic ductal adenocarcinoma password access only unpublished**

As such, datasets have been updated to following (added hnscc): ['brca', 'ccrcc', 'endometrial', 'colon', 'ovarian', 'hnscc', 'luad']

sgosline commented 3 years ago

Awesome, i cut and paste this and put it into the READMEs (main, mRNAdata, protData). Still testing them all with all the algorithms.

annapamma commented 3 years ago

Reopening because HNSCC is failing on Circle (although it works locally with CWL). I think this is an issue with permissions in the virtual environment.

Going to troubleshoot.

sgosline commented 3 years ago

Interesting I think this works for me as well. I wonder if it's a Docker image/build issue?

sgosline commented 3 years ago

I think I found it - hnscc wasn't added to the getAllDatasets.py in the mrna folder, just the protein folder. Testing now in pr #105. If it wasn't that, it was an indexing issue (metadata selected patients that had no transcriptomics). That has also been fixed.