Fill missing levels in variables for `goat-cli`

In pulling the public JSON:

curl -X 'GET' \
'https://goat.genomehubs.org/api/v0.0.1/resultFields?result=taxon&taxonomy=ncbi' \
-H 'accept: application/json' > vars.json 2> /dev/null

cat vars.json | jq

I can get all of the variables which is nice. Some variables, which I will list, have no constraint enums. This hinders some useful parsing in goat-cli. Full list:

bioproject
biosample
busco_lineage
in_progress
insdc_open
insdc_submitted
published
sample_acquired
sample_collected
sample_collected_by
sample_sex
sex_determination

To be explicit, biosample (rendered in md) has a length of 32 on constraint, but no actual fields:

group	name	constraint	display_group	organelle	separator	source	source_url_stub	type	display_level	display_name	key	summary	traverse	traverse_direction
taxon	biosample	{len: 32}	assembly	nucleus	[;]	NCBI Datasets	https://www.ncbi.nlm.nih.gov/assembly/	keyword	2	Biosample	biosample	[list]	list	up

whereas family_representative does:

group	name	display_group	display_level	constraint	summary	traverse	traverse_direction	type
taxon	family_representative	target_lists	2	{enum: [asg, cbp, ebpn, cfgp, dtol, ebpn, endemixit, erga, eurofish, gaga, squalomix, metainvert, vgp, agi, arg, gap, gbr, omg, tsi, b10k]}	list	list	up	keyword

As you mentioned @rjchallis:

"But part of the problem here is that it doesn't make sense to use an enum to restrict the input values for fields like bioproject and biosample as the potential list is so long. I think a better solution may be to apply a regex constraint on these fields, or to export the list of unique values from the index (this can be cached so only needs to generated once per release) either as part of this endpoint or something similar to the sources report that includes counts per value."

genomehubs / goat-data

Fill missing levels in variables for `goat-cli` #4