GenomicsStandardsConsortium / mixs

Minimum Information about any (X) Sequence” (MIxS) specification
https://w3id.org/mixs
Creative Commons Zero v1.0 Universal
38 stars 21 forks source link

We are misusing the string serialization slot #388

Open turbomam opened 2 years ago

turbomam commented 2 years ago

a string serialization of '{float} {unit}' implies that there are float and unit classes

See also LinkML issue https://github.com/linkml/linkml/issues/674

Switch to LinkML structured patterns

See also

turbomam commented 2 years ago

Semi-related

Some string serializations are really just lists and could be re-implemented as enumerations

slot string_serialization
aero_struc [plane|glider]
built_struc_set [urban|rural]
ceil_struc [wood frame|concrete]
contam_screen_input [reads| contigs]
detec_type [independent sequence (UViG)|provirus (UpViG)]
fireplace_type [gas burning|wood burning]
heat_sys_deliv_meth [conductive|radiant]
host_dependence [facultative|obligate]
seq_quality_check [none|manually edited]
shading_device_loc [exterior|interior]
space_typ_state [typically occupied|typically unoccupied]
sym_life_cycle_type [complex life cycle | simple life cycle]
urine_collect_meth [clean catch|catheter]
wga_amp_appr [pcr based|mda based]
window_status [closed|open]
turbomam commented 2 years ago

counts of structured pattern elements in MIxS

string_ser counts
{text} 239
{float} 90
{unit} 86
{[termID]} 75
{termLabel} 74
{URL} 35
{PMID} 34
{DOI} 34
{integer} 30
{Rn/start_time/end_time/duration} 26
{boolean} 15
{version} 13
{software} 11
{duration} 11
{parameters} 9
{term} 8
{timestamp} 7
{dna} 7
{PMID|DOI|URL} 3
{period} 2
{term label} 2
{NCBI taxid} 2
{rank name} 2
{database} 2
{clustering method} 1
{AF cutoff} 1
{ANI cutoff} 1
{PID} 1
{{text} 1
{day} 1
{term ID} 1
{measurement value} 1
{percentage} 1
{reference} 1
{interval} 1
{has numeric value} 1
{has unit} 1
turbomam commented 2 years ago

Non-alpha characters in the tokens above

Also not including whitespace

count char notes
2 _ separates words in a token's name
1 [ literal used with term IDs, like mountain [ENVO:12345678]
1 ] literal used with term IDs, like mountain [ENVO:12345678]
38 { wraps token. Also, see sieving below
37 } wraps token
3 / delimits sub-tokens in {Rn/start_time/end_time/duration}
2 | delimits alternative tokens for literature references

sieving

{{text}|{float} {unit}};{float} {unit}`

literature references

{PMID|DOI|URL}
turbomam commented 2 years ago

Should clarify the differences between

Canonical

Variants

turbomam commented 2 years ago

see also https://github.com/microbiomedata/mixs/pull/37

ddooley commented 9 months ago

Is this all standardized now or is there outstanding work visa vis MIxS or LinkML?

ramonawalls commented 9 months ago

Good quesiton, @ddooley . @turbomam , could you provide an update?