This PR documents how metadata tables (TSVs and CSVs with genome or transcriptome accession number and lineages) were created. The metadata tables will serve as inputs to the Snakefile, defining wildcards like genome accession and genus that the pipeline needs to use.
The pipeline only operates on NCBI genomes or transcriptomes. The genomes have to have _CDS_from_genomic.fna.gz files, and the transcriptomes have to be available on the TSA and annotated as eukaryotic.
There are two notebooks right now. The first generates metadata tables for all Eukaryotic genomes and transcriptomes that meet the above criteria. The second uses these tables to pull out a small set of ~100 fungal genomes and transcriptomes that Emily is interested in and that I plan to use as a biological test case for the pipeline before deploying it on all the things.
Some of the code in one of the notebooks looks like it wasn't run -- it was, I just didn't want to re-run those chunks since they query the ncbi API to retrieve lineages. I restarted the notebook and resumed running things by reading in an RDS object that stored the intermediate files.
Lastly, the README provides a general overview of what this repo aims to do to help orient others to whats going on. it's a work in progress, but I think it's good enough for now given the half-baked state of the repo so far!
This PR documents how metadata tables (TSVs and CSVs with genome or transcriptome accession number and lineages) were created. The metadata tables will serve as inputs to the Snakefile, defining wildcards like genome accession and genus that the pipeline needs to use.
The pipeline only operates on NCBI genomes or transcriptomes. The genomes have to have
_CDS_from_genomic.fna.gz
files, and the transcriptomes have to be available on the TSA and annotated as eukaryotic.There are two notebooks right now. The first generates metadata tables for all Eukaryotic genomes and transcriptomes that meet the above criteria. The second uses these tables to pull out a small set of ~100 fungal genomes and transcriptomes that Emily is interested in and that I plan to use as a biological test case for the pipeline before deploying it on all the things.
Some of the code in one of the notebooks looks like it wasn't run -- it was, I just didn't want to re-run those chunks since they query the ncbi API to retrieve lineages. I restarted the notebook and resumed running things by reading in an RDS object that stored the intermediate files.
Lastly, the README provides a general overview of what this repo aims to do to help orient others to whats going on. it's a work in progress, but I think it's good enough for now given the half-baked state of the repo so far!