GMOD / jbrowse-components

Source code for JBrowse 2, a modern React-based genome browser
https://jbrowse.org/jb2
Apache License 2.0
205 stars 61 forks source link

Add NCBI sequence_report.tsv alias adapter, with ability to recode NCBI fasta files to use UCSC style names #4516

Closed cmdcolin closed 1 month ago

cmdcolin commented 1 month ago

FASTA files from NCBI generally report the sequence names using the refseq IDs e.g. human chr1 is NC_000001.11 and is listed in the fasta file as

>NC_000001.11
....

with the sequence_report.tsv file, it provides a column called UCSC chromosome names with e.g. "chr1" for this ID

this PR allows us to provide a sequence_report.tsv as a refNameAliases adapter, and also adds the new ability to re-code the names of the FASTA file using the names from the refNameAliases adapter.

this is actually a sort of tricky operation that takes advantage of the fact that the refNameAliasAdapter returns an array of {refName:string,aliases:string[]}

so, we can choose to accept the refNames that the RefNameAliasAdapter returns as the primary set of refNames for the assembly, rather than the ones returned by getRefNames from the assembly sequence adapter.

cmdcolin commented 1 month ago

fixes https://github.com/GMOD/jbrowse-components/issues/4471