AstrobioMike / GToTree

A user-friendly workflow for phylogenomics
GNU General Public License v3.0
192 stars 25 forks source link

HTTP error with downloading and parsing archaeal and bacteria metadata tables from GTDB #72

Closed oduwoleiyanu closed 1 year ago

oduwoleiyanu commented 1 year ago

Hi I noticed this error while running gtt-test. sh. I think the HTTP of gtdb database might have changed. How do I resolve this Here is the error! File "/home/ioduwole/miniconda3/envs/gtotree/lib/python3.9/urllib/request.py", line 641, in http_error_default raise HTTPError(req.full_url, code, msg, hdrs, fp) urllib.error.HTTPError: HTTP Error 404: Not Found

AstrobioMike commented 1 year ago

Thank you so much for writing in about this, @oduwoleiyanu!

Yes, there was indeed a change in GTDB here from their version file being called just VERSION to now being VERSION.txt

Thanks to a previous issue (https://github.com/AstrobioMike/GToTree/issues/71), i fixed this in a separate program (gtt-get-accessions-from-GTDB), but i apparently didn't catch it and fix it a helper program. And if the other was run first, this problem wasn't happening, which must be why i missed this as a problem when i setup my latest version installation πŸ€¦β€β™‚οΈ

I'm embarrassed I don't have auto-testing set up here yet...

This is fixed as of v1.7.07, which is up now in my conda channel, and will be up in bioconda soon. This command will install it from my channel:

mamba create -y -n gtotree -c astrobiomike -c conda-forge -c bioconda -c defaults gtotree=1.7.07
conda activate gtotree

Side note on mamba if that's new If you aren't familiar with mamba yet, it's a drop-in for conda that we install on top of conda that makes virtually all installs much faster and smoother, it's worth putting everywhere you use conda, it can be installed like so:

conda install -n base -c conda-forge mamba

Then all installs or env creations should be done with mamba up front instead of conda like noted above, but activations and deactivations of environments should still be done with conda like noted above.

Thanks again!

oduwoleiyanu commented 1 year ago

You are welcome. Okay I can try installing it from Mamba. Also, let me know when it is up on bioconda.

Best regards, Iyanu On Thu, Feb 2, 2023 at 4:57 PM Mike Lee @.***> wrote:

Thank you so much for writing in about this, @oduwoleiyanu https://github.com/oduwoleiyanu!

Yes, there was indeed a change in GTDB here https://data.gtdb.ecogenomic.org/releases/latest/ from their version file being called just VERSION to now being VERSION.txt

Thanks to a previous issue (#71 https://github.com/AstrobioMike/GToTree/issues/71), i fixed this in a separate program (gtt-get-accessions-from-GTDB), but i apparently didn't catch it and fix it a helper program. And if the other was run first, this problem wasn't happening, which must be why i missed this as a problem when i setup my latest version installation πŸ€¦β€β™‚οΈ

I'm embarrassed I don't have auto-testing set up here yet...

This is fixed as of v1.7.07, which is up now in my conda channel, and will be up in bioconda soon. This command will install it from my channel:

mamba create -y -n gtotree -c astrobiomike -c conda-forge -c bioconda -c defaults gtotree=1.7.07

conda activate gtotree


Side note on mamba if that's new If you aren't familiar with mamba yet, it's a drop-in for conda that we install on top of conda that makes virtually all installs much faster and smoother, it's worth putting everywhere you use conda, it can be installed like so:

conda install -n base -c conda-forge mamba

Then all installs or env creations should be done with mamba up front instead of conda like noted above, but activations and deactivations of environments should still be done with conda like noted above.

Thanks again!

β€” Reply to this email directly, view it on GitHub https://github.com/AstrobioMike/GToTree/issues/72#issuecomment-1414428780, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQZFE3YIP4N55WLYIKSG4SDWVQUTXANCNFSM6AAAAAAUPPLWCY . You are receiving this because you were mentioned.Message ID: @.***>

AstrobioMike commented 1 year ago

It’s updated in bioconda too now πŸ‘

oduwoleiyanu commented 1 year ago

Thanks Mike. I realized that GTOtree did not compute 17418 representatives genomes I wanted to do. Please, how would I modify the cat command to take 50 representatives from each phyla I want to use. For example cat .*txt > all-file , how would just choose 50 reps out of the GTDB representatives from each phylum.

Thanks Best Regards, Iyanu

On Fri, Feb 3, 2023 at 11:52 AM Mike Lee @.***> wrote:

Closed #72 https://github.com/AstrobioMike/GToTree/issues/72 as completed.

β€” Reply to this email directly, view it on GitHub https://github.com/AstrobioMike/GToTree/issues/72#event-8434316036, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQZFE3YRTKLQ5ORH6TF35KTWVUZV3ANCNFSM6AAAAAAUPPLWCY . You are receiving this because you were mentioned.Message ID: @.***>

AstrobioMike commented 1 year ago

heya, Iyanu,

That can't be done with cat, unfortunately.

Right now i have a helper program packaged with GToTree that will take one random member per specified rank. When i'm making a tree that I want to span the known diversity of bacteria/archaea, i use it like this:

First, getting all GTDB representative genome accessions:

gtt-get-accessions-from-GTDB -t all --GTDB-representatives-only

With the current v207, 8-Apr-2022 GTDB database, that's 65,703, which is for sure also way to many to include.

But then i typically subset that down to hold 1 random member per Order, like so (this takes the table produced from the previous command as input):

gtt-subset-GTDB-accessions --get-only-individuals-for-the-rank order -i GTDB-arc-and-bac-refseq-rep-metadata.tsv

#  65,703 initial entries were subset down to 1,593
# 
#  Subset accessions file for GToTree written to:
#     subset-accessions.txt
# 
#   A subset GTDB taxonomy table for these accessions written to:
#     subset-accessions-taxonomy.tsv

That is down to 1,593, which is much more reasonable to tree and be able to explore/visualize. Doing it at the Class level, instead of Order, cuts it down to 481, but again i typically use Order just fine. So we'd take that "subset-accessions.txt" file and give it to our GToTree run along with our own new genomes we're putting in.

There are 189 phyla in GTDB currently, and i'm sure not all of them have 50 unique representative genomes, but it would still probably be a pretty high number of genomes, that if tree'd successfully, would probably be difficult to visualize or explore.

Do you think the above approach will work for you?

The gtt-subset-GTDB-accessions program can have any rank specified, but right now it is written to always return just 1 random genome for each taxon of the specified rank. If the above won't work for your purposes here, I could probably add an option to be able to specify how many genomes per rank we'd want to randomly get back. In which case you could then specify phylum and 50, but again i don't think that particular combo will work out, and if the goal is having a view across the known diversity of bacteria/archaea, i think the Order way above with 1 each is a good way to go. But let me know if you want me to build in that option, unless i hit a weird problem trying to do it, I think I could get it in sometime this weekend. If you're trying to do something else entirely, then I'll need some more details to be able to try to help out

cheers :)

oduwoleiyanu commented 1 year ago

Thanks. This works for me. I also want to appreciate how easy and effective your tools are for beginners. Your github pages for bioinfo tools and GTotree are well arranged and easy to follow. If there is an award for that, I am definitely voting for you. Please, do not hesitate to pull out more resources. Also, I would love to join for your mentoring group if you have any. I have learnt so much from your tools.

Thanks.

Best regards, Iyanu

On Fri, Feb 3, 2023 at 6:33 PM Mike Lee @.***> wrote:

heya, Iyanu,

That can't be done with cat, unfortunately.

Right now i have a helper program packaged with GToTree that will take one random member per specified rank. When i'm making a tree that I want to span the known diversity of bacteria/archaea, i use it like this:

First, getting all GTDB representative genome accessions:

gtt-get-accessions-from-GTDB -t all --GTDB-representatives-only

With the current v207, 8-Apr-2022 GTDB database, that's 65,703, which is for sure also way to many to include.

But then i typically subset that down to hold 1 random member per Order, like so (this takes the table produced from the previous command as input):

gtt-subset-GTDB-accessions --get-only-individuals-for-the-rank order -i GTDB-arc-and-bac-refseq-rep-metadata.tsv

65,703 initial entries were subset down to 1,593# # Subset accessions file for GToTree written to:# subset-accessions.txt# # A subset GTDB taxonomy table for these accessions written to:# subset-accessions-taxonomy.tsv

That is down to 1,593, which is much more reasonable to tree and be able to explore/visualize. Doing it at the Class level, instead of Order, cuts it down to 481, but again i typically use Order just fine. So we'd take that "subset-accessions.txt" file and give it to our GToTree run along with our own new genomes we're putting in.

There are 189 phyla in GTDB currently, and i'm sure not all of them have 50 unique representative genomes, but it would still probably be a pretty high number of genomes, that if tree'd successfully, would probably be difficult to visualize or explore.

Do you think the above approach will work for you?

The gtt-subset-GTDB-accessions program can have any rank specified, but right now it is written to always return just 1 random genome for each taxon of the specified rank. If the above won't work for your purposes here, I could probably add an option to be able to specify how many genomes per rank we'd want to randomly get back. In which case you could then specify phylum and 50, but again i don't think that particular combo will work out, and if the goal is having a view across the known diversity of bacteria/archaea, i think the Order way above with 1 each is a good way to go. But let me know if you want me to build in that option, unless i hit a weird problem trying to do it, I think I could get it in sometime this weekend. If you're trying to do something else entirely, then I'll need some more details to be able to try to help out

cheers :)

β€” Reply to this email directly, view it on GitHub https://github.com/AstrobioMike/GToTree/issues/72#issuecomment-1416529464, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQZFE353TWKCOPQ7JPA5ECLWVWIT5ANCNFSM6AAAAAAUPPLWCY . You are receiving this because you were mentioned.Message ID: @.***>

AstrobioMike commented 1 year ago

Great! And thanks so much for the kind words, Iyanu ❀️