Open hs-apotell opened 3 years ago
Most of changes in the UHDM schema are not new "types", but instead they are new "relations" these days. At some point in the future when completing the Testbench part, there will be new types too. The solution you propose does not work for new relations. I don't think we should even try to maintain compatibility, you'd load and incomplete db and wonder why you are not seeing a particular type or particular relation you are expecting. On the other hand, we should have a version number that invalidates a given .uhdm database, so it would not be allowed to be read by a version of the tool that is different.
A little context here would have helped. My bad.
This issue came up because one of team member is attempting to get a language server
working and he has a very basic functional prototype to showcase. It was just by accident that he was using a rather large design for test case and the response time was very slow. So, he decided to fallback to load just the cache file, which worked well (for the minimum feature set he was focusing on) until he pulled and refreshed his local repository.
Long story short, if we are working with rather large design and we know for sure that only a specific file has been modified and need to be refreshed in the cache, can this be done safely. This thread is the result of that discussion.
So, coming back to the need for fast response time that is required for a language server like utility to work, what can be done to avoid/prevent having to re-parse the entire design over and over again for every keystroke?
@alaindargelas - I suggest we embed the git hash from the UHDM repository and make sure we check that?
@hs-apotell - Caching the UHDM output and only recreating the UHDM files based on modification is a perfectly reasonable thing to do and something we would like to support (at Google we want to parse + cache every SystemVerilog file in our massive code repository https://www.freecodecamp.org/news/googles-entire-codebase-is-a-single-repository-of-2-billion-lines-of-code-ececf7354188/). Most of the time upgrading UHDM is going to be a pretty rare event.
I think we should work towards the binary UHDM format to be stable in a sense that it can be read with older and newer uhdm libraries.
The serialization can be made stable if we make sure to keep the IDs in the capnp schema fixed. I guess that means that new fields in the yaml files would always have to be added to the end as they are given incrementally (or hashes as suggested, though I think that might not be possible or will use a lot of space). We can have a CI check that makes sure that there is no shuffling of existing IDs.
The code however also need to be able to deal with forward/backward compatibility. One direction is easy:
I would advise against a scheme in which we can only read a very specific version (e.g. checking git has), that makes things very brittle. We should be able to read a wide range of different versions.
@hzeller - I don't think UHDM should be used for any type of long term storage and the extra complexity of dealing with backwards and forwards compatibility doesn't provide much value here.
Not really long term storage, just good enough to be stable a few versions before and after. And it is very simple to achieve, so it will not take much engineering time to make it happen.
My experience has been that forward and backwards compatibility needs a large amount of testing to verify it works, otherwise it is most certainly broken and produces a huge amount of weird bugs which take significant time to debug.
In my experience with code (at a large organization I am in and uses protocol buffers :) ), I almost never ran into that problem if designed carefully. So it is doable, but yes, it requires unit tests.
I think our colors are showing here -- @mithro == SRE, @hzeller == SWE -- ;-)
I suggest we split this issue in 2, one for the backward-forward compatibility (Keep this thread), the other for the language server.
I created #1712 for the language server
Most of the time upgrading UHDM is going to be a pretty rare event.
This might be true for Google but making that a standard is shortsighted and dangerous.
I would advise against a scheme in which we can only read a very specific version (e.g. checking git has), that makes things very brittle.
+1 to that. I don't think limiting to a specific version is a scalable solution. Parsing text is always going to be slow. If we can make the binary format dependable that would be a huge plus.
My experience has been that forward and backwards compatibility needs a large amount of testing to verify it works, otherwise it is most certainly broken and produces a huge amount of weird bugs which take significant time to debug.
I agree on your point that backward/forward compatibility does need a large amount of work. In our past projects, we have avoided the compatibility related bugs by putting the user incharge of verifying against the latest release. A bug is valid only if it happens on the latest release and confirming the issue is always the responsibility of the user (and not the developer). Patches on past/old releases was never a thing on our projects.
I guess that means that new fields in the yaml files would always have to be added to the end as they are given incrementally (or hashes as suggested, though I think that might not be possible or will use a lot of space).
If all fields in the yaml files are hashes then we don't have to constrain to appending only. Moreover, in my opinion, your point about space is toothless - for the time it takes to parse large designs, adding a few extra bytes per entry would always be a good compromise. For a design with ~180 files the cache file is about ~28 MB and if it grows by even 50% that would still be within a acceptable range. It takes nearly ~6 mins to parse this design.
We can have a CI check that makes sure that there is no shuffling of existing IDs.
Rather than to look for shuffling, the process should look for hash collisions within the class hierarchy.
In context to using hashes - Hashes are a good solution if done right, with proper checks in place to identify collisions and notifying the user of action needed to resolve the issue.
Is the generated cache file backward compatible? i.e. if a
*.uhdm
file is generated using one version of the binary, can that same be loaded with a newer version of the binary?I presume the answer is "No" because if new classes/properties/refs (basically, any change in model file) were introduced in UHDM, the
uhdmxxxx
IDs will reshuffle losing the ability to load cache files generated using older binaries. So, the next question is - Do these IDs have to be sequential? Is that a requirement as per the standard? If not, why not make these IDs be some sort of hash of their corresponding names to maintain compatibility across binaries.