SamGa3 / microbiome_reconstruction

GNU General Public License v3.0
14 stars 2 forks source link

Inquiry about the Sources of COAD_technical_metadata.txt and Other Text Files in Metadata #4

Closed HaozhongMa closed 1 month ago

HaozhongMa commented 11 months ago

Hello Gaia,

I hope this message finds you well. I am writing to inquire about the sources of the text files included in the metadata, specifically COAD_technical_metadata.txt, along with other similar text files in the repository.

As a user of your repository, I am keen to understand the origin and the methodology used for generating these text files. I tried to find it in the TCGA website but failed.

Could you please provide some insight into the following aspects:

The origin of COAD_technical_metadata.txt and other text files in the metadata. The process or methodology used to compile or generate these files. Any relevant documentation or references that could aid in understanding these files better. Your assistance in this matter is greatly appreciated as it will significantly enhance my understanding and usage of the data provided in your repository.

Thank you for your time and consideration. I look forward to your response.

Best regards, Haozhong Ma

SamGa3 commented 10 months ago

Dear Haozhong Ma,

As you have observed, the TCGA metadata I utilized for this project is not directly available on TCGA or GDC. Instead, it is a summary of the available information from GDC API, you can find an overview of available fields here https://docs.gdc.cancer.gov/API/Users_Guide/Appendix_A_Available_Fields/ . Additionally, detailed information about the usage of the API can be found https://docs.gdc.cancer.gov/API/Users_Guide/Search_and_Retrieval/ .

To address the potential redundancy in some TCGA barcodes (resulting from samples being analyzed multiple times), I opted to identify each file using the GDC file id. Please note that GDC recently updated the files using a newer protocol, resulting in different file ids. To match the current files to the ones I used, you can use TCGA barcodes or download the manifest from this link https://github.com/NCI-GDC/gdc-docs/blob/develop/docs/Data/Release_Notes/GCv36_Manifests/ . The old files are no longer available.

The remaining metadata were sourced from previously published studies. If you have specific files in mind, could you please specify which ones you are interested in?

Best regards, Gaia

P.S. I noticed that you closed your previous question. Although I couldn't reproduce the problem you encountered, I did encounter errors when Humann attempted to download its repositories. Unfortunately, I have limited control over this aspect. I hope you were able to find a solution, anyway this step was non-trivial for reproducing the results of the paper.

HaozhongMa commented 10 months ago

Dear Gaia,

I plan to organize the metadata for PAAD. I previously attempted to extract this information from the ”biospecimen“ and ”clinical“ files downloaded from TCGA, but there are still some columns that I couldn't retrieve from these sources. I also couldn't find these columns in the API. How should I go about obtaining these columns?

PAAD_clinical_metadata

PAAD_technical_metadata

For the Humann, I previously encountered some issues, but upon retrying to run Humann, it seems the problem has been resolved. I am not sure for what's happening.

Thank you for your attention and readiness to help.

Best regards, Haozhong Ma

SamGa3 commented 10 months ago

Hi Haozhong Ma,

Certain columns were retrieved from other papers, e.g. MSI status was retrieved from this paper: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5972025/ . I briefly listed these cases in the method section of my paper. Notice that not all the columns available for COAD are available for other datasets (e.g. history_colon_polyps).

I guess there was a problem with the connection to download Humann references that now has been solved.

Best regards, Gaia