GW-HIVE / biomuta-old

Documentation: https://biomuta.readthedocs.io/en/latest/
0 stars 0 forks source link

TCGA Google Big Data Query Evaluation #4

Open jeet-vora opened 2 months ago

jeet-vora commented 2 months ago

For TCGA Google Big Data Query check

mariacuria commented 2 months ago

@seankim658

mariacuria commented 1 month ago

There are 2 parts to obtaining data from TCGA:

  1. Primary TCGA data
  2. TCGA controlled-access data
    • Hosted at dbGaP. @rajamazumder is going to give me access.

Big Query

mariacuria commented 1 month ago

Google BigQuery Pricing

BigQuery pricing has two main components:

  1. Compute pricing is the cost to process queries, including SQL queries, user-defined functions, scripts, and certain data manipulation language (DML) and data definition language (DDL) statements. 1.1. On-demand pricing (per TiB). With this pricing model, you are charged for the number of bytes processed by each query ($6.25 per TiB). The first 1 TiB of query data processed per month is free. 1.2 Capacity pricing (per slot-hour). With this pricing model, you are charged for compute capacity used to run queries, measured in slots (virtual CPUs) over time. This model takes advantage of BigQuery editions. You can use the BigQuery autoscaler or purchase slot commitments, which are dedicated capacity that is always available for your workloads, at a lower price.
    • Standard: $0.04 / slot hour. No commitment. Billed per second with a 1 minute minimum
    • Enterprise: $0.06 / slot hour. Billed per second with a 1 minute minimum
    • Enterprise 1 year: $0.048 / slot hour. Billed for 1 year
    • Enterprise 3 years: $0.036 / slot hour. Billed for 3 years
    • Enterprise Plus: $0.1 / slot hour. Billed per second with a 1 minute minimum
    • Enterprise Plus 1 year: $0.08 / slot hour. Billed for 1 year
    • Enterprise Plus 3 years: $0.06 / slot hour. Billed for 3 years
  2. Storage pricing is the cost to store data that you load into BigQuery. Storage pricing is the cost to store data that you load into BigQuery. You pay for active storage and long-term storage. 2.1. Active storage includes any table or table partition that has been modified in the last 90 days. The first 10 GiB is free each month. 2.2. Long-term storage includes any table or table partition that has not been modified for 90 consecutive days. The price of storage for that table automatically drops by approximately 50%. There is no difference in performance, durability, or availability between active and long-term storage. The first 10 GiB is free each month.
    • Active logical storage: $0.02 per GiB per month.
    • Long-term logical storage: $0.01 per GiB per month.
    • Active physical storage: $0.04 per GiB per month.
    • Long-term physical storage: $0.02 per GiB per month.

Loading, copying, exporting, deleting and metadata operations up to certain limits are free. BigQuery also has a free usage tier:

Estimate cost of running a query, calculate the byte processed by various queries, and get a monthly cost estimate based on your projected usage: https://cloud.google.com/bigquery/docs/best-practices-costs

mariacuria commented 1 month ago

TCGA License

Data from TCGA projects are organized into two tiers: Open Access and Controlled Access.

Open Access data tier contains data that cannot be attributed to an individual research participant. The Open Access data tier does not require user certification. Data in Open Access tier are available in the TCGA Data Portal.

Controlled Access data tier contains individual-level genotype data that are unique to an individual. Access to data in the Controlled Access data tier requires user certification through dbGaP Authorized Access mentioned above. Subject to 2023 Data Use Certification Agreement. Here's the summary:

  1. Introduction and Statement of Policy

    • NIH repositories store and share controlled-access human data securely.
    • Data sharing must respect participants’ informed consent and privacy.
  2. Terms of Access

    2.1. Research Use

    • Approved Users can only use data for the specified research project.
    • Cloud computing use requires specific permissions.

      2.2. Requester and Approved User Responsibilities

    • Users must follow NIH Security Best Practices and relevant laws.
    • Annual progress updates and project renewals are required.

      2.3. Public Posting of Approved Users’ Research Use Statement

    • The PI agrees to publicly post information about themselves, their approved research use, and related details on the dbGaP website, including project specifics and citations of resulting publications.

      2.4. Non-Identification:

    • Users must not identify or contact individual participants.
    • Identifiable information can only be used with specific IRB approval.

      2.5. Certificate of Confidentiality

    • This certificate protects sensitive information in NIH databases from being disclosed in legal proceedings or to unauthorized individuals. Disclosure is only permitted under specific conditions, such as with the individual’s consent or for medical treatment.

      2.6. Non-Transferability

    • NIH controlled-access datasets and their derivatives must be retained by the approved users and cannot be distributed to unauthorized entities or individuals, ensuring data security and compliance with NIH policies.

      2.7. Data Security and Unauthorized Data Release

    • Requester and Approved Users are responsible to manage and protect controlled-access datasets according to NIH security practices, and to promptly report any unauthorized data sharing or breaches.

      2.8. Policy Compliance Violations

    • NIH may terminate data access if the requester violates the NIH GDS Policy, Data Use Certification Agreement, or Genomic Data User Code of Conduct, and requires prompt notification and remediation of any unauthorized data sharing or breaches.

      2.9. Intellectual Property

    • The Requester and Approved Users acknowledge that anyone who has access follows the intellectual property principles.

      2.10. Dissemination of Research Findings and Acknowledgement of Controlled-Access Datasets Subject to the NIH GDS Policy

    • Approved Users are encouraged to widely disseminate research findings from NIH-controlled datasets through publications and presentations, and must acknowledge the original data contributors and funding sources in all disclosures.

      2.11. Research Use Reporting

    • The PI must provide annual progress updates, including data usage, publications, future research plans, and any policy violations, as part of the project renewal or close-out process.

      2.12. Non-Endorsement, Indemnification

    • The NIH and data contributors do not guarantee the accuracy or reliability of the data and are not liable for any loss or damage resulting from its use.

      2.13. Termination and Data Destruction

    • Upon project completion, all copies and derivatives of the dataset must be destroyed, except for data retained to comply with institutional policies, laws, or scientific transparency, which must still be managed according to NIH security practices.
mariacuria commented 1 month ago

Clinical patient information

According to the NIH Bioinformatics Training and Education Program, TCGA is comprised of genomic, epigenomic, transcriptomic, and proteomic data combined with rich clinical information and related metadata from over 11,000 patients representing 33 cancer types.