evolgeniusteam / GMrepoProgrammableAccess

programmable access to GM repo
GNU General Public License v3.0
22 stars 13 forks source link

mapping microbiome->phenotype labels by run/sample id #5

Open raqueldias opened 2 years ago

raqueldias commented 2 years ago

Hi, I have downloaded the microbial abundances and phenotype information. But I can't find a data dictionary or a way to map the run/sample ids to the microbiome+phenotype data (e.g., I want to be able to map the microbial abundances to the phenotypes of each run/sample so I can have labeled data for testing a machine learning classifier). How can I do that by using the programmable access tool?

evolgeniusteam commented 2 years ago

My apology for the long delay. Just found the notification in my junk mailbox.

You will find lots of useful information at the download page: https://evolgeniusteam.github.io/gmrepodocumentation/usage/downloaddatafromgmrepo/.

For example, the run to phenotype information is available be found in the following downloads: "Runs to phenotypes" or "the All runs".

Let me know if need further assistance.

Weihua

raqueldias commented 2 years ago

Thanks! I have downloaded all those files. The only problem is that the relative abundance file has no run information, it contains only "loaded uid" in the first column as identifier (this is the file I'm talking about, https://gmrepo.humangut.info/Downloads/SQLDumps/species_abundance.txt.gz). So my question now is if there is a way to map those uid numbers from the first column of that file, to the phenotype table, that has no uid column.

evolgeniusteam commented 2 years ago

I see. There is actually a uid to accession_id/run_id file to download: https://gmrepo.humangut.info/Downloads/SQLDumps/samples_loaded.txt.gz. You will still need the other files I mentioned previously to map the run_ids to the meta-data though. Cheers