Open hpages opened 1 year ago
Commit 959d4b97cfcb305fc1dec1553ab52391a9ad52ea tries to improve this. With this change, now we get:
BSgenomeForge:::abbreviate_organism_name("Torque teno virus 1")
# [1] "Tvirus1"
The idea is to retain and pack together any digit that comes after the last non-digit/non-whitespace character:
BSgenomeForge:::abbreviate_organism_name("Abc def xy 55 6 77 ")
# [1] "Axy55677"
BSgenomeForge:::abbreviate_organism_name("Abc def 1 22 xy 55 6 77 ")
# [1] "Axy55677"
We still need to look for other exotic organism names at NCBI and see what happens with them.
Arghh.. it looks like NCBI has some organism names that contain parentheses e.g. Arthrobacter phage DrManhattan (viruses)
. See https://www.ncbi.nlm.nih.gov/assembly/GCF_003692155.1/
Right now BSgenomeForge:::abbreviate_organism_name()
does a terrible job on this:
> BSgenomeForge:::abbreviate_organism_name("Arthrobacter phage DrManhattan (viruses)")
[1] "A(viruses)"
This ain't good because forgeBSgenomeDataPkgFromNCBI()
will use this abbreviated name to construct the name of the corresponding BSgenome data package:
> BSgenomeForge:::.create_pkgname("A(viruses)", "ASM369215v1")
[1] "BSgenome.A(viruses).NCBI.ASM369215v1"
Ouch! This package name is invalid (parentheses are not allowed).
Another problematic organism name (ASM19595v2 assembly):
> BSgenomeForge:::abbreviate_organism_name("Mycobacterium tuberculosis H37Rv")
[1] "MH37Rv"
Works "fine" on organism names made of 2 parts (this is the most common situation):
and on organism names made of 3 parts like "Canis lupus familiaris":
but not so much on something like "Torque teno virus 1":
Note that it's been a "tradition" so far to embed the abbreviated organism name in the name of BSgenome data packages e.g.
The idea is that the package name can hint at what organism the package is about. However, for something like
the hint is a tough one!
So there's no clear rule for how the organism name should be abbreviated, as long as it's still "recognizable". Yes, this is all very subjective :disappointed: