genomehubs / goat-data

MIT License
2 stars 0 forks source link

Fill missing levels in variables for `goat-cli` #4

Open Euphrasiologist opened 2 years ago

Euphrasiologist commented 2 years ago

In pulling the public JSON:

curl -X 'GET' \
'https://goat.genomehubs.org/api/v0.0.1/resultFields?result=taxon&taxonomy=ncbi' \
-H 'accept: application/json' > vars.json 2> /dev/null

cat vars.json | jq

I can get all of the variables which is nice. Some variables, which I will list, have no constraint enums. This hinders some useful parsing in goat-cli. Full list:

To be explicit, biosample (rendered in md) has a length of 32 on constraint, but no actual fields:

group name constraint display_group organelle separator source source_url_stub type display_level display_name key summary traverse traverse_direction
taxon biosample {len: 32} assembly nucleus [;] NCBI Datasets https://www.ncbi.nlm.nih.gov/assembly/ keyword 2 Biosample biosample [list] list up

whereas family_representative does:

group name display_group display_level constraint summary traverse traverse_direction type
taxon family_representative target_lists 2 {enum: [asg, cbp, ebpn, cfgp, dtol, ebpn, endemixit, erga, eurofish, gaga, squalomix, metainvert, vgp, agi, arg, gap, gbr, omg, tsi, b10k]} list list up keyword

As you mentioned @rjchallis:

"But part of the problem here is that it doesn't make sense to use an enum to restrict the input values for fields like bioproject and biosample as the potential list is so long. I think a better solution may be to apply a regex constraint on these fields, or to export the list of unique values from the index (this can be cached so only needs to generated once per release) either as part of this endpoint or something similar to the sources report that includes counts per value."