Open nickdos opened 2 years ago
https://www.ebi.ac.uk/ena/browser/text-search?query=Dromaius%20novaehollandiae
The 8 assembly files appear to contain the same accession numbers from NCBI genome data:
GCA_020892055.1 emu_male_v1.2 assembly for Dromaius novaehollandiae GCA_003342905.1 droNov1 assembly for Dromaius novaehollandiae GCA_013396795.1 ASM1339679v1 assembly for Dromaius novaehollandiae GCA_020892035.1 emu_female_v2.2_2 assembly for Dromaius novaehollandiae GCA_016128335.1 ZJU1.0 assembly for Dromaius novaehollandiae GCA_020892015.1 emu_female_v2.2_1 assembly for Dromaius novaehollandiae GCA_006938045.1 anoDid_nucDNA_mapDamage assembly for Anomalopteryx didiformis GCA_006937325.1 anoDid_nucDNA_orig assembly for Anomalopteryx didiformis
vs NCBI (Demo app):
GCA_016128335.1 Dromaius novaehollandiae GCA_020892015.1 Dromaius novaehollandiae GCA_003342905.1 Dromaius novaehollandiae GCA_020892035.1 Dromaius novaehollandiae GCA_013396795.1 Dromaius novaehollandiae GCF_003342905.1 Dromaius novaehollandiae GCA_020892055.1 Dromaius novaehollandiae
Missing record from NCBI https://www.ebi.ac.uk/ena/browser/view/GCA_020892035?show=assembly-stats and XML version: https://www.ebi.ac.uk/ena/browser/api/xml/GCA_020892035.1
I contacted ENA support about the possibility of there being a bulk service like NCBI and got this reply:
https://www.ebi.ac.uk/ena/portal/api/search?result=assembly&fields=all&limit=10&format=tsv
- change format to json if you prefer that
- fields=all includes all indexed fields. can request only ones of interset by proving a comma separeted list
- set limit=0 to get all records
This looks promising.
Edit: ran quite quickly and resulted in TSV file with 1,362,660 lines of data, which is similar to NCBI genome/genbank at 1,325,926 (1,589,290 combined with refseq).
Example data in JSON:
{
"accession": "GCA_000001215",
"study_accession": "PRJNA13812",
"sample_accession": "SAMN02803731",
"secondary_sample_accession": "",
"assembly_name": "Release 5",
"assembly_title": "Release 5 assembly for Drosophila melanogaster",
"study_name": "Drosophila melanogaster strain:y; cn bw sp",
"study_title": "The Drosophila melanogaster genome was assembled from a combination of whole genome shotgun (WGS) and clone-based sequence data, and is annotated by the FlyBase Consortium.",
"study_description": "The D. melanogaster genome is approximately 180 Mb in size, 120 Mb of which is euchromatic, and was sequenced as an early test of the applicability of whole-genome shotgun (WGS) sequencing technology for large eukaryotic genomes. The genome is organized into two large and one small autosomes, an X chromosome, and a heterochromatic Y chromosome. The heterochromatin is mainly comprised of simple repeats, transposable elements, and tandem arrays of ribosomal RNA genes. A small number of genes have also been found within the heterochromatin. The sequence of the D. melanogaster genome, originally determined in a collaboration between Celera and the Berkeley Drosophila Genome Project, is described in the March 24, 2000 issue of Science. Ongoing efforts at the Berkeley Drosophila Genome Project and Drosophila Heterochromatin Genome Project have corrected and expanded the sequence (Celniker et al., 2002; Hoskins et al., 2002). In August 2007, Release 5.2 was made public, providing a unified assembly of the euchromatin and heterochromatin with only 8 gaps remaining in the euchromatic chromosome arm assemblies. A genome browser containing the release 5 assembly and annotation data is available at FlyBase. Release 5 of the D. melanogaster genomic sequence is available for download from GenBank, and from the Berkeley Drosophila Genome Project website along with the Release 5 notes. The FlyBase Consortium provides a high quality annotation of the D. melanogaster genome (Misra et al., 2002). FlyBase curators manually review available genomic, transcript, and protein sequence data to generate gene models based on traceable support evidence. Annotation includes protein coding genes, pseudogenes, and non-coding RNA genes. The annotation provided by FlyBase is available at the FlyBase web site, in sequence updates submitted to the archival databases (DDBJ/EMBL/GenBank), in the NCBI RefSeq records, and in the NCBI Map Viewer genome browser.",
"tax_id": "7227",
"scientific_name": "Drosophila melanogaster",
"strain": "",
"base_count": "139465864",
"assembly_level": "chromosome",
"genome_representation": "full",
"last_updated": "2016-11-29",
"version": "2",
"assembly_type": "",
"wgs_set": "AABU01",
"run_ref": ""
}
NCBI API Biosample: https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=biosample&id=2803731&retmode=json Bioproject: https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=bioproject&id=13812&retmode=json
Mapping to NCBI
ENA | NCBI | DwC |
---|---|---|
accession | assembly_accession | |
study_accession | bioproject | |
sample_accession | biosample | |
secondary_sample_accession | ||
assembly_name | asm_name | |
assembly_title | ??? | |
study_name | ||
study_title | ||
study_description | ||
tax_id | ||
scientific_name | ||
strain | ||
base_count | ||
assembly_level | ||
genome_representation | ||
last_updated | ||
versionassembly_type | ||
wgs_set | ||
run_ref |
https://www.ebi.ac.uk/ena/browser/home https://www.ebi.ac.uk/ena/browser/downloading-data https://www.ebi.ac.uk/ena/browser/about/content https://ena-docs.readthedocs.io/en/latest/retrieval/file-download/ena-ftp-structure.html https://www.ebi.ac.uk/ena/browser/checklists https://ftp.ebi.ac.uk/pub/databases/ena/
https://www.ebi.ac.uk/training/online/courses/ena-quick-tour/searching-and-visualising-data-ena/ Example search for Emu: https://www.ebi.ac.uk/ena/browser/text-search?query=Dromaius%20novaehollandiae Taxonomy REST example https://www.ebi.ac.uk/ena/taxonomy/rest/scientific-name/Dromaius%20novaehollandiae