Arrays of length 1 - Githubissues

romgrk commented 4 years ago

Hey,

I was wondering why it was decided to use arrays of length 1 instead of simply the item itself? This is not optimal at all for JSON.

ping @sitag @dzerbino

sitag commented 4 years ago

@romgrk it allows us to use a single parser for all schema versions. so it's optimal in that sense.

romgrk commented 4 years ago

What's the issue with the parser exactly? Is there another way to solve it?

sitag commented 4 years ago

like i said, there's no way to solve it without writing different parsers for each version, and conceivably more for future versions.

i don't follow what the issue is here? we are not counting bytes in intermediate jsons - that would be a very premature optimization.

romgrk commented 4 years ago

We're using a thing that parse JSON schemas and turns them into UI forms. Having those elements as array of strings adds complexity to the generated form.

It also adds unnecessary complexity to the schema. Every human that will read it will wonder why it needs to be in an array if it's a single item.

sitag commented 4 years ago

there are multiple ways to extract xml to json which is what we validate. xml allows repeat values for attributes, so we need to account for that in json. it's also free what is required to be unique for each each version. so there's no way around using arrays - unless we track what is unique at parsing stage.

us writing different parsers and maintaining them is way more work than you writing a preprocessor for your use case - especially as everything is uniform: you will apply same transformation to each hash describing the attribute after you loop over it.

while it would blow up complexity on our end, your pre-processor is fairly trivial due to aforementioned uniformity.

we are definitely not writing version specific parsers.

romgrk commented 4 years ago

I've been talking with @dbujold and we where wondering if it was possible to have more details on this issue. When you say writing different parsers, you mean writing different parsers for the XML and JSON versions? Could you point us to specific places in the code where this is problematic for you? And when you refer to the parse, you mean this?

Changing the spec to use length-1 arrays isn't just problematic for my use-case, it makes it more complicated for all end users. We would also need to adapt the IHEC data portal to ingest and produce this new kind of data hub. We already have many partners using and submitting the no-arrays kind of datahub. I do understand the need for the spec to evolve, but I feel like this is not an evolution: it's pushing an implementation issue onto users.

sitag commented 4 years ago

@dbujold Not sure why we are still on this. The original implementation has ambiguity about what happens when we encounter repeated values. We cannot go back to having that ambiguity. Also, you were on the thread where this reason was discussed. The markdown generation code is now hooked into using arrays for each attribute. Reverting undoes a lot.

The simplest solution is to write a pre-processor for the jsonschema, which is very straight forward and there's a 1:1 mapping between the two. If your team does not want to implement the pre-processor then make an issue, and I will implement it.

Following this discussion has taken more time than it would to implement the pre-processor.

@romgrk Here https://github.com/IHEC/ihec-ecosystems/blob/master/version_metadata/sraparse.py#L120-L126 , https://github.com/IHEC/ihec-ecosystems/blob/master/version_metadata/validate_experiment.py#L8-L9 , https://github.com/IHEC/ihec-ecosystems/blob/master/version_metadata/validate_sample.py#L9-L16 you will need to know what attributes the spec allows to be lists, and what to do when you encounter a list when you expect a unique value. This is version specific (not only existsing, but also future versions). You will need to parse the schema while you parse xml into json. This is not just a matter of implementational complexity, this is also bad design: you are coupling components that should stay independent.

I don't understand your resistance to pre-processing the schema. It solves your problem with least amount of friction. All you do is map:

 "experiment_type" : {"type": "array", "minItems": 1, "maxItems": 1, "items": {"type": "string", "pattern": "^(ChIP-Seq Input)|(Histone H\\w+([\\./]\\w+)?)+$|(Transcription Factor)"}},

into:

 "experiment_type" : {"type": "string", "pattern": "^(ChIP-Seq Input)|(Histone H\\w+([\\./]\\w+)?)+$|(Transcription Factor)"},

literally element:array[type] goes to element:type whenever maxItems == minItems

for each attribute. You can write out new schema like that, you can do it on the fly. The world is your oyster.

IHEC / ihec-ecosystems

Arrays of length 1 #84