FASTA files from NCBI generally report the sequence names using the refseq IDs e.g. human chr1 is NC_000001.11 and is listed in the fasta file as
>NC_000001.11
....
with the sequence_report.tsv file, it provides a column called UCSC chromosome names with e.g. "chr1" for this ID
this PR allows us to provide a sequence_report.tsv as a refNameAliases adapter, and also adds the new ability to re-code the names of the FASTA file using the names from the refNameAliases adapter.
this is actually a sort of tricky operation that takes advantage of the fact that the refNameAliasAdapter returns an array of {refName:string,aliases:string[]}
so, we can choose to accept the refNames that the RefNameAliasAdapter returns as the primary set of refNames for the assembly, rather than the ones returned by getRefNames from the assembly sequence adapter.
FASTA files from NCBI generally report the sequence names using the refseq IDs e.g. human chr1 is NC_000001.11 and is listed in the fasta file as
with the sequence_report.tsv file, it provides a column called UCSC chromosome names with e.g. "chr1" for this ID
this PR allows us to provide a sequence_report.tsv as a refNameAliases adapter, and also adds the new ability to re-code the names of the FASTA file using the names from the refNameAliases adapter.
this is actually a sort of tricky operation that takes advantage of the fact that the refNameAliasAdapter returns an array of {refName:string,aliases:string[]}
so, we can choose to accept the refNames that the RefNameAliasAdapter returns as the primary set of refNames for the assembly, rather than the ones returned by getRefNames from the assembly sequence adapter.