Bioconductor / BSgenomeForge

Tools to forge BSgenome data packages
4 stars 3 forks source link

Try to improve BSgenomeForge:::abbreviate_organism_name() #18

Open hpages opened 1 year ago

hpages commented 1 year ago

Works "fine" on organism names made of 2 parts (this is the most common situation):

> BSgenomeForge:::abbreviate_organism_name("Homo sapiens")
[1] "Hsapiens"

> BSgenomeForge:::abbreviate_organism_name("Felis catus")
[1] "Fcatus"

and on organism names made of 3 parts like "Canis lupus familiaris":

> BSgenomeForge:::abbreviate_organism_name("Canis lupus familiaris")
[1] "Cfamiliaris"

but not so much on something like "Torque teno virus 1":

> BSgenomeForge:::abbreviate_organism_name("Torque teno virus 1")
[1] "T1"

Note that it's been a "tradition" so far to embed the abbreviated organism name in the name of BSgenome data packages e.g.

BSgenome.Hsapiens.NCBI.GRCh38
BSgenome.Cfamiliaris.UCSC.canFam3

The idea is that the package name can hint at what organism the package is about. However, for something like

BSgenome.T1.NCBI.ViralProj15247

the hint is a tough one!

So there's no clear rule for how the organism name should be abbreviated, as long as it's still "recognizable". Yes, this is all very subjective :disappointed:

hpages commented 1 year ago

Commit 959d4b97cfcb305fc1dec1553ab52391a9ad52ea tries to improve this. With this change, now we get:

BSgenomeForge:::abbreviate_organism_name("Torque teno virus 1")
# [1] "Tvirus1"

The idea is to retain and pack together any digit that comes after the last non-digit/non-whitespace character:

BSgenomeForge:::abbreviate_organism_name("Abc def xy  55  6 77 ")
# [1] "Axy55677"

BSgenomeForge:::abbreviate_organism_name("Abc def 1 22 xy  55  6 77 ")
# [1] "Axy55677"

We still need to look for other exotic organism names at NCBI and see what happens with them.

hpages commented 1 year ago

Arghh.. it looks like NCBI has some organism names that contain parentheses e.g. Arthrobacter phage DrManhattan (viruses). See https://www.ncbi.nlm.nih.gov/assembly/GCF_003692155.1/

Right now BSgenomeForge:::abbreviate_organism_name() does a terrible job on this:

> BSgenomeForge:::abbreviate_organism_name("Arthrobacter phage DrManhattan (viruses)")
[1] "A(viruses)"

This ain't good because forgeBSgenomeDataPkgFromNCBI() will use this abbreviated name to construct the name of the corresponding BSgenome data package:

> BSgenomeForge:::.create_pkgname("A(viruses)", "ASM369215v1")
[1] "BSgenome.A(viruses).NCBI.ASM369215v1"

Ouch! This package name is invalid (parentheses are not allowed).

hpages commented 1 year ago

Another problematic organism name (ASM19595v2 assembly):

> BSgenomeForge:::abbreviate_organism_name("Mycobacterium tuberculosis H37Rv")
[1] "MH37Rv"