Closed mikerobeson closed 1 year ago
awesome, thanks @mikerobeson ! Let me know when this is ready to review. Just a few early comments and questions pre-review, none of which are very important:
get_silva_data
. One issue I see is that GTDB only supplies separate trees for bacteria and archaea.Hi @nbokulich,
Archaea
, Bacteria
, Both
. But I was concerned how many users might simply download only bacteria, for classifying their bacterial data without any form of 'outgroup' taxa, e.g. the Archaea. I did not want to contribute to the issue we see often see with fungal ITS data. That is, when classifying fungi w/o non-fungi outgroup taxa, many non-fungal taxa can still be classified as fungi. I could offer this option though, if there are other use cases I'm not considering.get-gtdb-ssu
, and then we can set up another get-gtdb-genomes
action for the short-term. I've not looked into setting up new data types for genome data, unless they already exist, or if they're needed. I am also not sure of what anyone else has done with genome data within QIIME 2.Hi @mikerobeson ,
both
the default would limit accidental abuse.q2-types-genomics
. @misialq do you think that this should be a separate action or exposed as an option (e.g., to grab SSU or genome seqs)? Either way, this can be a follow-up PR but we should decide now for optimal naming.Hey both,
just a quick comment on the SSU vs. genomes: I'm not sure what the genome data fetched from GTDB looks exactly like but I think it would probably make sense and would not be too complicated to just have it in the same action as most of the code would be shared between those. And yes, q2-types-genomics
does have some GenomeData
types, although those are more suited for storing genomic features (loci, genes, proteins) rather than full genomes (if it's just genomic nt sequences that are being fetched, aren't those anyway more like a FeatureData[Sequence]
type?)
(if it's just genomic nt sequences that are being fetched, aren't those anyway more like a FeatureData[Sequence] type?)
Yeah as far as I recall, these are just FASTAs of the whole genomes, so FeatureData[Sequence]
should work.
Hey @mikerobeson are your latest changes ready for review or are you still working on it?
I did some user testing to check the outputs etc, and all works 👍 . However, I notice that there are many more records in the taxonomy (317543) than there are sequences (40660) (these counts are from the "Both" outputs). I confirmed that this matches the contents of the files from GTDB, so it looks like everything is operating as intended. But I wonder: do you know why there is this disparity?
Hi @nbokulich,
I noticed the disparity between the taxonomy and sequence counts too. I think the reason is, this is the taxonomy for all of GTDB. That is, not all GTDB genomes have an associated rRNA sequence, and/or these where removed from the SSU files due to quality issues. I was considering adding a call to rescript filter-taxa
to remove the extra taxonomy labels, as the extra labels might cause user confusion.
hey @mikerobeson , This also LGTM, thanks! Are you done making changes or is this ready to merge?
I was considering adding a call to rescript filter-taxa to remove the extra taxonomy labels, as the extra labels might cause user confusion.
No, I do not think that we should do this, leave it to the user to decide.
I think GTDB provides a file of md5 hashes for the downloads, so if you want you could just add an md5 check after the download. Then in case anyone is asking why the files don't match, we can just say (with absolute certainty!) that the action is just downloading exactly what GTDB provides.
@nbokulich ahh darn! I just added the "excess taxa filtering" and an associated test. I'll remove them and re-commit. Then it'll be ready to merge.
Okay @nbokulich, I removed the excess taxa filtering code. This is ready to merge!
This PR addresses #47, to download SSU data from GTDB.
The user provides the version of SSU data they wish to download from GTDB. Currently only versions
202
and207
are allowed, with207
being the default.The code will download the sequence (.tar.gz) and taxonomy (.gz) files for the Archaea and Bacteria, then merge them together producing two output files of the types: