galaxyproject / brc-analytics

MIT License
0 stars 2 forks source link

Reference data for all species #99

Open nekrut opened 2 days ago

nekrut commented 2 days ago

Sept 26, 2024

@natefoo started the process:

Sept 19, 2024

To make Galaxy useful for BRC community we need to integrate reference data for all 785 species by downloading sequence data and building indices for:

murphyte commented 1 day ago

Looping in @NullModel

In addition to the seqid issue I've mentioned before, where the VEuPathDB files refer to a sequence by an identifier currently unknown to NCBI, there is a set of genomes where VEuPathDB has listed a corresponding GCA or GCF accession, but the sequences are not the same. Some examples are:

  1. PrelictumSGS1-like | GCA_900005765.1. The sequence PRELSG_99_v1 appears to be a concatenation of all the individual unplaced sequences
  2. PreichenowiCDC | GCA_000723685.1. The chromosomes do not match because of leading or trailing "N"s. e.g. HG810762.1 ~ PrCDC_01_v3, except for some leading or trailing Ns and some short runs of non-N bases in gaps. The submitter was likely required to trim some contaminating adaptors at the ends, and to drop contigs <200 bp, but gave VEuPathDB their pre-submission version
  3. Plasmodium ovale | PovalecurtisiGH01 | GCA_900090035.2 - includes API and MT sequences not in the GenBank assembly. The included sequences are similar but not quite identical to some sequences available in GenBank. PocGH01_API_v2 ~ LT594596.1 with two Ns changed to sequence, PocGH01_MIT_v2 ~ AB354571.1 with 1 mismatch.

(1) & (2) are a simple remapping problem, feasible to do for gene tracks and could be scaled to others. (3) is more problematic. We would need permission from the genome and organelle submitters to add an organelle to the GCA assembly, plus that particular assembly is managed through ENA so we'd need them in the loop as well. We do not currently have a Plasmodium ovale genome in RefSeq. We may pick one of the two from VEuPathDB (none of the available genomes for this species are particularly good), and we could add the MIT and API to the GCF assembly if they pass QC. So there's a path to handle those, but (3) is actually more complicated than (1) and (2).

We haven't yet worked up the full list of which additional genomes we're aiming to add to RefSeq. We also plan to update the RefSeq annotations on some of these genomes where VEuPathDB had better data, which will require UCSC reloading their gene track from the GCF.