ENH: initial draft of `get-gtdb`

mikerobeson commented 1 year ago

This PR addresses #47, to download SSU data from GTDB.

The user provides the version of SSU data they wish to download from GTDB. Currently only versions 202 and 207 are allowed, with 207 being the default.

The code will download the sequence (.tar.gz) and taxonomy (.gz) files for the Archaea and Bacteria, then merge them together producing two output files of the types:

FeatureData[Taxonomy]
FeatureData[Sequence]

nbokulich commented 1 year ago

awesome, thanks @mikerobeson ! Let me know when this is ready to review. Just a few early comments and questions pre-review, none of which are very important:

instead of automatically downloading and merging both bacteria and archaea, do you think that it would be useful to expose an option? E.g., for bacteria, archaea, or both? (both can be default)
I suppose right now it is just SSU. Do you think that an option (or separate action) to download the GTDB genomes (instead of SSU) would be useful? @misialq any thoughts?
GTDB also provides reference phylognies. Do you think that these would be a useful companion file to download? We discussed this previously re: get_silva_data. One issue I see is that GTDB only supplies separate trees for bacteria and archaea.

mikerobeson commented 1 year ago

Hi @nbokulich,

I was considering setting up an option for selecting Archaea, Bacteria, Both. But I was concerned how many users might simply download only bacteria, for classifying their bacterial data without any form of 'outgroup' taxa, e.g. the Archaea. I did not want to contribute to the issue we see often see with fungal ITS data. That is, when classifying fungi w/o non-fungi outgroup taxa, many non-fungal taxa can still be classified as fungi. I could offer this option though, if there are other use cases I'm not considering.
I can go either way with this. If needed, I can rename this action as get-gtdb-ssu, and then we can set up another get-gtdb-genomes action for the short-term. I've not looked into setting up new data types for genome data, unless they already exist, or if they're needed. I am also not sure of what anyone else has done with genome data within QIIME 2.
Yeah, the issue is the separate phylogenies, I am not sure how useful they'd be, or what they'd be used for... Perhaps helping construct a separate GTDB based sepp reference for archaea and bacteria?

nbokulich commented 1 year ago

Hi @mikerobeson ,

yeah good point. I suppose I can't really think of a good use case, unless if the data are being used in some other way (e.g., for building a phylogeny but not for taxonomic classification). My gut is still telling me to give the user options, but my brain agrees that we don't want to make something that can be abused. Making both the default would limit accidental abuse.
Relevant types exist for genome sequence data in q2-types-genomics. @misialq do you think that this should be a separate action or exposed as an option (e.g., to grab SSU or genome seqs)? Either way, this can be a follow-up PR but we should decide now for optimal naming.
Yeah agreed, it's unfortunate. 😞

misialq commented 1 year ago

Hey both, just a quick comment on the SSU vs. genomes: I'm not sure what the genome data fetched from GTDB looks exactly like but I think it would probably make sense and would not be too complicated to just have it in the same action as most of the code would be shared between those. And yes, q2-types-genomics does have some GenomeData types, although those are more suited for storing genomic features (loci, genes, proteins) rather than full genomes (if it's just genomic nt sequences that are being fetched, aren't those anyway more like a FeatureData[Sequence] type?)

nbokulich commented 1 year ago

(if it's just genomic nt sequences that are being fetched, aren't those anyway more like a FeatureData[Sequence] type?)

Yeah as far as I recall, these are just FASTAs of the whole genomes, so FeatureData[Sequence] should work.

nbokulich commented 1 year ago

Hey @mikerobeson are your latest changes ready for review or are you still working on it?

I did some user testing to check the outputs etc, and all works 👍 . However, I notice that there are many more records in the taxonomy (317543) than there are sequences (40660) (these counts are from the "Both" outputs). I confirmed that this matches the contents of the files from GTDB, so it looks like everything is operating as intended. But I wonder: do you know why there is this disparity?

mikerobeson commented 1 year ago

Hi @nbokulich,

I noticed the disparity between the taxonomy and sequence counts too. I think the reason is, this is the taxonomy for all of GTDB. That is, not all GTDB genomes have an associated rRNA sequence, and/or these where removed from the SSU files due to quality issues. I was considering adding a call to rescript filter-taxa to remove the extra taxonomy labels, as the extra labels might cause user confusion.

nbokulich commented 1 year ago

hey @mikerobeson , This also LGTM, thanks! Are you done making changes or is this ready to merge?

I was considering adding a call to rescript filter-taxa to remove the extra taxonomy labels, as the extra labels might cause user confusion.

No, I do not think that we should do this, leave it to the user to decide.

I think GTDB provides a file of md5 hashes for the downloads, so if you want you could just add an md5 check after the download. Then in case anyone is asking why the files don't match, we can just say (with absolute certainty!) that the action is just downloading exactly what GTDB provides.

mikerobeson commented 1 year ago

@nbokulich ahh darn! I just added the "excess taxa filtering" and an associated test. I'll remove them and re-commit. Then it'll be ready to merge.

mikerobeson commented 1 year ago

Okay @nbokulich, I removed the excess taxa filtering code. This is ready to merge!

bokulich-lab / RESCRIPt

ENH: initial draft of `get-gtdb` #154