===========================
Data provided per assembly:
===========================
Sequence and other data files provided per assembly are named according to the
rule:
[assembly accession.version]_[assembly name]_[content type].[optional format]
File formats and content:
assembly_status.txt
A text file reporting the current status of the version of the assembly
for which data is provided. Any assembly anomalies are also reported.
*_assembly_report.txt file
Tab-delimited text file reporting the name, role and sequence
accession.version for objects in the assembly. The file header contains
meta-data for the assembly including: assembly name, assembly
accession.version, scientific name of the organism and its taxonomy ID,
assembly submitter, and sequence release date.
*_assembly_stats.txt file
Tab-delimited text file reporting statistics for the assembly including:
total length, ungapped length, contig & scaffold counts, contig-N50,
scaffold-L50, scaffold-N50, scaffold-N75, and scaffold-N90
*_assembly_regions.txt
Provided for assemblies that include alternate or patch assembly units.
Tab-delimited text file reporting the location of genomic regions and
listing the alt/patch scaffolds placed within those regions.
*_assembly_structure directory
This directory will only be present if the assembly has internal
structure. When present, it will contain AGP files that define how
component sequences are organized into scaffolds and/or chromosomes.
Other files define how scaffolds and chromosomes are organized into
non-nuclear and other assembly-units, and how any alternate or patch
scaffolds are placed relative to the chromosomes. Refer to the README.txt
file in the assembly_structure directory for additional information.
*_cds_from_genomic.fna.gz
FASTA format of the nucleotide sequences corresponding to all CDS
features annotated on the assembly, based on the genome sequence. See
the "Description of files" section below for details of the file format.
*_feature_count.txt.gz
Tab-delimited text file reporting counts of gene, RNA, CDS, and similar
features, based on data reported in the *_feature_table.txt.gz file.
See the "Description of files" section below for details of the file
format.
*_feature_table.txt.gz
Tab-delimited text file reporting locations and attributes for a subset
of annotated features. Included feature types are: gene, CDS, RNA (all
types), operon, C/V/N/S_region, and V/D/J_segment. Replaces the .ptt &
.rnt format files that were provided in the old genomes FTP directories.
See the "Description of files" section below for details of the file
format.
*_gene_expression_counts.txt.gz
Tab-delimited text file with counts of RNA-seq reads mapped to each gene.
See "Description of files" section below for details of the file format.
*_gene_ontology.gaf.gz
Gene Ontology (GO) annotation of the annotated genes in GO Annotation
File (GAF) format. Additional information about the GAF format is
available at
http://geneontology.org/docs/go-annotation-file-gaf-format-2.1/
*_genomic.fna.gz file
FASTA format of the genomic sequence(s) in the assembly. Repetitive
sequences in eukaryotes are masked to lower-case (see below).
The FASTA title is formatted as sequence accession.version plus
description. The genomic.fna.gz file includes all top-level sequences in
the assembly (chromosomes, plasmids, organelles, unlocalized scaffolds,
unplaced scaffolds, and any alternate loci or patch scaffolds). Scaffolds
that are part of the chromosomes are not included because they are
redundant with the chromosome sequences; sequences for these placed
scaffolds are provided under the assembly_structure directory.
*_genomic.gbff.gz file
GenBank flat file format of the genomic sequence(s) in the assembly. This
file includes both the genomic sequence and the CONTIG description (for
CON records), hence, it replaces both the .gbk & .gbs format files that
were provided in the old genomes FTP directories.
*_genomic.gff.gz file
Annotation of the genomic sequence(s) in Generic Feature Format Version 3
(GFF3). Sequence identifiers are provided as accession.version.
Additional information about NCBI's GFF files is available at
https://ftp.ncbi.nlm.nih.gov/genomes/README_GFF3.txt.
*_genomic.gtf.gz file
Annotation of the genomic sequence(s) in Gene Transfer Format Version 2.2
(GTF2.2). Sequence identifiers are provided as accession.version.
*_genomic_gaps.txt.gz
Tab-delimited text file reporting the coordinates of all gaps in the
top-level genomic sequences. The gaps reported include gaps specified in
the AGP files, gaps annotated on the component sequences, and any other
run of 10 or more Ns in the sequences. See the "Description of files"
section below for details of the file format.
*_protein.faa.gz file
FASTA format sequences of the accessioned protein products annotated on
the genome assembly. The FASTA title is formatted as sequence
accession.version plus description.
*_protein.gpff.gz file
GenPept format of the accessioned protein products annotated on the
genome assembly
*_rm.out.gz file
RepeatMasker output;
Provided for Eukaryotes
*_rm.run file
Documentation of the RepeatMasker version, parameters, and library;
Provided for Eukaryotes
*_rna.fna.gz file
FASTA format of accessioned RNA products annotated on the genome
assembly; Provided for RefSeq assemblies as relevant (Note, RNA and mRNA
products are not instantiated as a separate accessioned record in GenBank
but are provided for some RefSeq genomes, most notably the eukaryotes.)
The FASTA title is provided as sequence accession.version plus
description.
*_rna.gbff.gz file
GenBank flat file format of RNA products annotated on the genome
assembly; Provided for RefSeq assemblies as relevant
*_rna_from_genomic.fna.gz
FASTA format of the nucleotide sequences corresponding to all RNA
features annotated on the assembly, based on the genome sequence. See
the "Description of files" section below for details of the file format.
*_rnaseq_alignment_summary.txt
Tab-delimited text file containing counts of alignments that were either
assigned to a gene or skipped for a specific reason. See "Description of
files" section below for details of the file format.
*_rnaseq_runs.txt
Tab-delimited text file containing information about RNA-seq runs used
for gene expression analyses (See *_featurecounts.txt file and *.bw files
within "RNASeq_coverage_graphs" directory).
*_translated_cds.faa.gz
FASTA sequences of individual CDS features annotated on the genomic
records, conceptually translated into protein sequence. The sequence
corresponds to the translation of the nucleotide sequence provided in the
*_cds_from_genomic.fna.gz file.
*_wgsmaster.gbff.gz
GenBank flat file format of the WGS master for the assembly (present only
if a WGS master record exists for the sequences in the assembly).
annotation_hashes.txt
Tab-delimited text file reporting hash values for different aspects
of the annotation data. See the "Description of files" section below
for details of the file format.
md5checksums.txt file
file checksums are provided for all data files in the directory
All of this data is provided by refseq for genomes. We should build a parser for getting all of this data into a JSON formatted nice format.
More than just a genbank parser, we should have a refseq parser.
From https://ftp.ncbi.nlm.nih.gov/genomes/refseq/archaea/Halobacterium_salinarum/latest_assembly_versions/GCF_000006805.1_ASM680v1/README.txt
All of this data is provided by refseq for genomes. We should build a parser for getting all of this data into a JSON formatted nice format.