Open nekrut opened 2 days ago
Looping in @NullModel
In addition to the seqid issue I've mentioned before, where the VEuPathDB files refer to a sequence by an identifier currently unknown to NCBI, there is a set of genomes where VEuPathDB has listed a corresponding GCA or GCF accession, but the sequences are not the same. Some examples are:
PRELSG_99_v1
appears to be a concatenation of all the individual unplaced sequences(1) & (2) are a simple remapping problem, feasible to do for gene tracks and could be scaled to others. (3) is more problematic. We would need permission from the genome and organelle submitters to add an organelle to the GCA assembly, plus that particular assembly is managed through ENA so we'd need them in the loop as well. We do not currently have a Plasmodium ovale genome in RefSeq. We may pick one of the two from VEuPathDB (none of the available genomes for this species are particularly good), and we could add the MIT and API to the GCF assembly if they pass QC. So there's a path to handle those, but (3) is actually more complicated than (1) and (2).
We haven't yet worked up the full list of which additional genomes we're aiming to add to RefSeq. We also plan to update the RefSeq annotations on some of these genomes where VEuPathDB had better data, which will require UCSC reloading their gene track from the GCF.
Sept 26, 2024
@natefoo started the process:
assembly version ID
column in https://brc-analytics.org/data/organisms)Sept 19, 2024
To make Galaxy useful for BRC community we need to integrate reference data for all 785 species by downloading sequence data and building indices for: