ARGA-Genomes / arga-data

ARGA
Mozilla Public License 2.0
0 stars 0 forks source link

Data: European Nucleotide Archive (ENA) #4

Open nickdos opened 2 years ago

nickdos commented 2 years ago

https://www.ebi.ac.uk/ena/browser/home https://www.ebi.ac.uk/ena/browser/downloading-data https://www.ebi.ac.uk/ena/browser/about/content https://ena-docs.readthedocs.io/en/latest/retrieval/file-download/ena-ftp-structure.html https://www.ebi.ac.uk/ena/browser/checklists https://ftp.ebi.ac.uk/pub/databases/ena/

https://www.ebi.ac.uk/training/online/courses/ena-quick-tour/searching-and-visualising-data-ena/ Example search for Emu: https://www.ebi.ac.uk/ena/browser/text-search?query=Dromaius%20novaehollandiae Taxonomy REST example https://www.ebi.ac.uk/ena/taxonomy/rest/scientific-name/Dromaius%20novaehollandiae

nickdos commented 2 years ago

https://www.ebi.ac.uk/ena/browser/text-search?query=Dromaius%20novaehollandiae

The 8 assembly files appear to contain the same accession numbers from NCBI genome data:

GCA_020892055.1 emu_male_v1.2 assembly for Dromaius novaehollandiae GCA_003342905.1 droNov1 assembly for Dromaius novaehollandiae GCA_013396795.1 ASM1339679v1 assembly for Dromaius novaehollandiae GCA_020892035.1 emu_female_v2.2_2 assembly for Dromaius novaehollandiae GCA_016128335.1 ZJU1.0 assembly for Dromaius novaehollandiae GCA_020892015.1 emu_female_v2.2_1 assembly for Dromaius novaehollandiae GCA_006938045.1 anoDid_nucDNA_mapDamage assembly for Anomalopteryx didiformis GCA_006937325.1 anoDid_nucDNA_orig assembly for Anomalopteryx didiformis

vs NCBI (Demo app):

GCA_016128335.1 Dromaius novaehollandiae GCA_020892015.1 Dromaius novaehollandiae GCA_003342905.1 Dromaius novaehollandiae GCA_020892035.1 Dromaius novaehollandiae GCA_013396795.1 Dromaius novaehollandiae GCF_003342905.1 Dromaius novaehollandiae GCA_020892055.1 Dromaius novaehollandiae

Missing record from NCBI https://www.ebi.ac.uk/ena/browser/view/GCA_020892035?show=assembly-stats and XML version: https://www.ebi.ac.uk/ena/browser/api/xml/GCA_020892035.1

nickdos commented 2 years ago

I contacted ENA support about the possibility of there being a bulk service like NCBI and got this reply:

https://www.ebi.ac.uk/ena/portal/api/search?result=assembly&fields=all&limit=10&format=tsv

  1. change format to json if you prefer that
  2. fields=all includes all indexed fields. can request only ones of interset by proving a comma separeted list
  3. set limit=0 to get all records

This looks promising.

Edit: ran quite quickly and resulted in TSV file with 1,362,660 lines of data, which is similar to NCBI genome/genbank at 1,325,926 (1,589,290 combined with refseq).

nickdos commented 2 years ago

Example data in JSON:

{
  "accession": "GCA_000001215",
  "study_accession": "PRJNA13812",
  "sample_accession": "SAMN02803731",
  "secondary_sample_accession": "",
  "assembly_name": "Release 5",
  "assembly_title": "Release 5 assembly for Drosophila melanogaster",
  "study_name": "Drosophila melanogaster strain:y; cn bw sp",
  "study_title": "The Drosophila melanogaster genome was assembled from a combination of whole genome shotgun (WGS) and clone-based sequence data, and is annotated by the FlyBase Consortium.",
  "study_description": "The D. melanogaster genome is approximately 180 Mb in size, 120 Mb of which is euchromatic, and was sequenced as an early test of the applicability of whole-genome shotgun (WGS) sequencing technology for large eukaryotic genomes. The genome is organized into two large and one small autosomes, an X chromosome, and a heterochromatic Y chromosome. The heterochromatin is mainly comprised of simple repeats, transposable elements, and tandem arrays of ribosomal RNA genes. A small number of genes have also been found within the heterochromatin. The sequence of the D. melanogaster genome, originally determined in a collaboration between Celera and the Berkeley Drosophila Genome Project, is described in the March 24, 2000 issue of Science. Ongoing efforts at the Berkeley Drosophila Genome Project and Drosophila Heterochromatin Genome Project have corrected and expanded the sequence (Celniker et al., 2002; Hoskins et al., 2002). In August 2007, Release 5.2 was made public, providing a unified assembly of the euchromatin and heterochromatin with only 8 gaps remaining in the euchromatic chromosome arm assemblies. A genome browser containing the release 5 assembly and annotation data is available at FlyBase. Release 5 of the D. melanogaster genomic sequence is available for download from GenBank, and from the Berkeley Drosophila Genome Project website along with the Release 5 notes. The FlyBase Consortium provides a high quality annotation of the D. melanogaster genome (Misra et al., 2002). FlyBase curators manually review available genomic, transcript, and protein sequence data to generate gene models based on traceable support evidence. Annotation includes protein coding genes, pseudogenes, and non-coding RNA genes. The annotation provided by FlyBase is available at the FlyBase web site, in sequence updates submitted to the archival databases (DDBJ/EMBL/GenBank), in the NCBI RefSeq records, and in the NCBI Map Viewer genome browser.",
  "tax_id": "7227",
  "scientific_name": "Drosophila melanogaster",
  "strain": "",
  "base_count": "139465864",
  "assembly_level": "chromosome",
  "genome_representation": "full",
  "last_updated": "2016-11-29",
  "version": "2",
  "assembly_type": "",
  "wgs_set": "AABU01",
  "run_ref": ""
}

NCBI API Biosample: https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=biosample&id=2803731&retmode=json Bioproject: https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=bioproject&id=13812&retmode=json

Mapping to NCBI

ENA NCBI DwC
accession assembly_accession
study_accession bioproject
sample_accession biosample
secondary_sample_accession
assembly_name asm_name
assembly_title ???
study_name
study_title
study_description
tax_id
scientific_name
strain
base_count
assembly_level
genome_representation
last_updated
versionassembly_type
wgs_set
run_ref