WormBase / wormbase-pipeline

Wormbase Build Pipeline
http://www.wormbase.org
22 stars 13 forks source link

BLAST metadata file #231

Closed MagdalenaZZ closed 2 years ago

MagdalenaZZ commented 2 years ago

We've created a meta-data file for BLAST sequences, according to this specification: Field Description URI Uniform resource identifier pointing to the FTP, HTTP(S), or file location description Brief description of the file contents. md5sum MD5 checksum of the file. version Version information (if any). blast_title Name that will be displayed on the database selection page. seqtype What type of sequence it is; nucleotide or protein. (BLAST uses 'nucl' or 'prot). comments Misc comments. meta Maybe a meta property for misc key/values.

e.g. { "URI": "https://ftp.flybase.org/genomes/Drosophila_melanogaster/current/fasta/dmel-all-chromosome-r6.44.fasta.gz", "description" : "Drosophila melanogaster genome assembly", "md5sum": "e18a87dfc861582d7984082a76d83db2", "version": "6.44", "blast_title": "D. melanogaster Genome Assembly", "seqtype": "nucl", "comments": "Some comment about the Dmel assembly", "meta": { "flybase_release": "FB2022_01", } }

MagdalenaZZ commented 2 years ago

We need some tweaks to this: add: Genus/Species, a contact email eg. help@wormbase.org for the person to contact if the file is missing. And ideally a reference to eg. "affected genomic model" or some field Alliance already has that would make each strain and assembly unique (so we can match it up with JBrowse instance of the same genome).

scottcain commented 2 years ago

Yes, genus and species would help, though we might also want a more general "JBrowse ID". For example, at the Alliance we only need genus and species to construct a url, eg

https://www.alliancegenome.org/jbrowse/?data=data%2FSaccharomyces%20cerevisiae

But at WormBase, we need the bioproject ID too:

https://wormbase.org/tools/genome/jbrowse-simple/?data=data%2Fc_elegans_PRJNA13758

which allows us to have more than one assembly and/or strain per species.

If we don't want to go down the path of putting in something like a JBrowse ID, then we'll probably want to have a separate mapping config to map between this info and what is needed to construct a JBrowse URL.

markquintontulloch commented 2 years ago

Seems like we should separate out the metadata key value pairs to avoid duplication, as we do for AGR - so we have a data array and a metaData hash.

@scottcain - would something like the attached file work for you? It doesn't contain the JBrowse ID (as I suspect the format of this may be liable to change at the Alliance once we start thinking about multiple assemblies), but it contains everything you need to construct it blast_meta.WS284.json.txt .

scottcain commented 2 years ago

Yes, this is very much what I had in mind.

MagdalenaZZ commented 2 years ago

I'm kind-of feeling like it may be nice to have an assembly-ID, so that the assembly itself has say an AllianceID AGR:34538459, and then we can just add species, strain, bioproject etc to that, but we still always know it it unique. Otherwise we are bound to run into these types of issues.

markquintontulloch commented 2 years ago

Metadata file now available from stable location on EBI FTP site: ftp://ftp.ebi.ac.uk/pub/databases/wormbase/misc_datasets/AGR/blast_meta.wormbase.json

MagdalenaZZ commented 2 years ago

Wonderful! Thank you!