bioshed / bioshed_atlas

GNU General Public License v3.0
0 stars 0 forks source link

bioshed search tcga #11

Closed bioshed closed 1 year ago

bioshed commented 1 year ago

Registry of Open Data AWS https://registry.opendata.aws/ TCGA and GDC also on AWS !!! https://www.cancer.gov/about-nci/organization/ccg/research/structural-genomics/tcga https://github.com/NCI-GDC/gdc-docs https://gdc.cancer.gov/developers/gdc-data-model https://gdc.cancer.gov/access-data/gdc-data-transfer-tool https://docs.gdc.cancer.gov/Data_Transfer_Tool/Users_Guide/Getting_Started/ https://docs.gdc.cancer.gov/Data_Transfer_Tool/Users_Guide/Data_Download_and_Upload/#resuming-a-failed-download

bioshed commented 1 year ago

(base) jerry@Jerrys-MacBook-Air tcga % cut -f 5 gdc_manifest.2022-11-23.tongue-base.full.txt | sort -u Biospecimen Clinical Copy Number Variation DNA Methylation Proteome Profiling Simple Nucleotide Variation Transcriptome Profiling

bioshed commented 1 year ago
- download annotations to single JSON per tissue:

https://portal.gdc.cancer.gov/repository?facetTab=cases&filters=%7B%22op%22%3A%22and%22%2C%22content%22%3A%5B%7B%22op%22%3A%22in%22%2C%22content%22%3A%7B%22field%22%3A%22cases.primary_site%22%2C%22value%22%3A%5B%22breast%22%5D%7D%7D%2C%7B%22op%22%3A%22in%22%2C%22content%22%3A%7B%22field%22%3A%22files.access%22%2C%22value%22%3A%5B%22open%22%5D%7D%7D%5D%7D

bioshed commented 1 year ago

TCGA repository on GDC https://portal.gdc.cancer.gov/repository

https://portal.gdc.cancer.gov/repository?facetTab=cases

bioshed commented 1 year ago

Allow search by: 1) site of primary tumor (primary site) 2) disease type 3) data category (SNV, sequencing reads, CNV, structural variation, transcriptome profiling, etc...) 4) data type (single cell, DGE, methylation, etc) 5) assay (RNA-Seq, WES/WXS, etc... 6) data format (bam/sam, vcf, ... 7) Platform (Illumina etc)

Create annotation files for each.

bioshed commented 1 year ago

[TODO] Download manifest and JSON files for each. Create full manifest (annotation) files for each. [TODO] Write function that takes a row of full manifest file and downloads file [TODO] Write function that takes search terms and looks in full manifest files. Think of a way to do this with API. Or perhaps use a “translator” file. e.g.: bioshed search tcga breast cancer rna-seq => will search for “breast rnaseq” (cancer or tumor or tumour is implied, and dash is removed) in all file names, then “breast”, then “rnaseq”… and then take intersection. [TODO] can search by specific terms - this will be more accurate.

bioshed commented 1 year ago

Valid search categories are: --tissue / --assay / --assaytarget / --celltype / --disease / --genome / --filetype / --platform / --species