ababaian / palmdb

Database of virus RdRp barcode sequences
Creative Commons Zero v1.0 Universal
0 stars 0 forks source link

[palmdb4] Generate sOTU Nickname File `sOTU.nickname.list` - Populate to 10 million entries #8

Closed almosnow closed 2 months ago

almosnow commented 2 months ago

Get adverbs list/code from @ababaian

ababaian commented 2 months ago

Every distinctsOTU in palmDB is identified by a unique number: such as u16. This is not always helpful for humans to read/remember so I implemented a nickname system which provides random colorful nicknames to each sOTU such that it's more memorable.

The format of the palmprint nickname is <adjective><Noun>. Each Adjective / Noun is one word long, written in camel-case as adjNoun. The list of English Adjectives and Nouns was taken from WordNet (https://wordnet.princeton.edu/), the data.noun and data.adj files. The parsed list of Adjectives and Nouns is attached here: wordnet.adjNoun.zip

To Close

To do this, extract all currently assigned nickName from palmdb2 and dump them to a 3-column, ordered tsv file (order by sOTU in increasing number).

sOTU.nickname.list example

nickid  sotu     nickname
1       u16      skyKing
...
x       u47468   unposedSave
...
n       u301630  phallicUpdate
n+1     NA       raisedCurrency
...
10000000 NA    pickledCuticle

This file should be populated using adj Noun list upto ~10 million unique (non-repeating) nicknames. Unassigned nicknames will be designated NA in the sotu column. This file will be used to create the palmdb table nickname column. No need to create a table from this file, we can simply update the file as we update palmdb versions.

almosnow commented 2 months ago

Found several duplicate nicknames assigned on the current palmdb2 table.

image

almosnow commented 2 months ago

SELECT COUNT(nickname) FROM palmdb2 WHERE centroid = true LIMIT 8;

513,176

SELECT COUNT(DISTINCT(nickname)) FROM palmdb2 WHERE centroid = true LIMIT 8;

512,261

So, about a thousand, not huge.

How to proceed?

Suggestion, re-assign new nicknames to those.

almosnow commented 2 months ago

Done,

File is half a GB so I won't upload it here, but it's ready to be used at some point.

Github https://github.com/serratus-bio/logan-backend/commit/edd9d782a34a2ec28202cc716ba59c147a9f2c9a