Open Adhara3 opened 1 week ago
I scanned a bit more through the can definition and im not completly sure but i think that there is a hidden rule.
So BO and SG dont need a line termination but the definition does not include any other keywords. All definitions that can include another Keyword like a comment or BA_DEFDEF do require a line termination.
If thats true you can read to the next keyword or to ; depending on the currenlty parsed definition.
I found some examples, especially this one which uses ANTLR and thus contains a grammar file. Comments in this grammar file are also interesting, also this note (I felt less alone).
Unfortunately, I couldn't test the grammar yet (I'm struggling with Java) and it doesn't work either in ANTLR LAB. The only thing I know is that going down this path would require adding a third party nuget dependency to DbcParserLib in order to work, which may be a bad thing.
ASAM MCD-2 MC interface description files (*.a2l) if the target platform shall be connected to a measurement and calibration tool like ETAS INCA or Vector Informatik CANape
That was asked here in #49
Another interesting concept is here when it says
The only additional reusable piece of code is the class SemanticCheckListener, which performs a (simple) semantic validation of the parse result. The antlr4 grammar does not check all static elements of the syntax definition. Forcing the grammar to do all of this would sometimes be too cumbersome. To give a single example: For obscure reasons, the name of an attribute is defined to be an identifier enclosed in double quotes - a syntax construct, which conflicts with the common character string. The grammar accepts any double quote enclosed character string for attribute name and only SemanticCheckListener will validate that the characters indeed form an identifier
which means that some checks can be performed after the pure parsing phase and it's fine to do so.
I also checked cantools which is in python and parses the file taking advantage of textparser a python lib for parsing with a defined grammar, so a similar flavour. This file contains the grammar definition for textparser lib.
Bottom line: mixed feeling. On one end this seems the preferred direction to avoid manual stuff, on the other end it requires new skills and code dependencies, so in a way we lose a bit of control on the code.
I tried to have a look if some library could help me out, and i stumbled upn Sprache and it was fun because in this article there is a list of possible approaches which exaclty the above list. Shure this would also add a dependency, but we would keep more control over code without reinventing the wheel.
Still a work in progress, I'll keep you posted
A
Theres no problem in using a third party library as long as the license is compatible ;)
i did some tried some stuff with Sprache. My aim was to get definitions from a textfile. not like parsing the defintions themself just get them and then use the regex parsers to do the rest. I however just used some dummy texts but as poc it should work.
What i like is that you can do order independend parsing. examplte you could parse the line for a value table before the actual value table definition and then do the real parsing in the right order so it doesnt lead to a fault.
what i dont like is that if a have a invalid syntax in one definition i fail everything so the whole parse fails. But that might be something able to fix but im not sure.
i would like how other parsers work. Whats the failing condition. Like do they open what they can read like we do or just fail or complet? What do other parsers do for errorhandling? I mean i checked with cantools in python and they seem to fail out completly too if they cant parse it
Testparsing for a one line message definition (this time full parsing): https://pastebin.com/pSUG7MGG
I wanted to parse factor and offset of signals as double but that would actually not fail and just parse the message with empty signals. So it doesnt fail completly always only on the toplevel so in this case if message itself fails parsing this would get an exception. What is missing is that i have no idea how to debug this. like i said parsing as double for factor and offset doesnt work and i have currently no idea how to debug that.
Hi,
I just pushed in a new branch some code that works with a new parser and tokenizer.
[!NOTE] This is by no means complete!!!!
Still a lot of work to be done, I just concentrated on having some end to end code for a couple of keywords to see if the concept could work. Yes, I'm still returning null
here and there but that's note the point.
I would be glad if you could give a feedback or even try to implement a new parser on your own among the ones that are missing. It shouldn't be a complete bullet proof implementation, just to see how if goes, if you feel confortable
The main points of interest are:
TextBrowser
which is the core of the new traversing modeChainBuilder
which was a naif way to implement something easy to use for a first rough parsing that looked friendly and somehow similar to sprache (e.g. fluent).;
at the end of one of the comments lines. The multiline VAL_
should also be easy to implement.Let me know if you feel this is a good path to follow
Cheers A
[!Important] The parsing code should be quite user friendly to allow easy extensibility in case of need but it's not part of the public API so we can cut some corner or refactor later on for improvement without affecting user experience. Moreover, I would favor parsing correctness over everything else
My 2 cents A
Foreword
When we begun contributing in this library we assumed that this way of writing a dbc had a meaning
A parser that goes line by line seemed the best solution to achieve an efficient file read and an easy parsing code. Time proved we were wrong, the newlines like the indenting is just for human readability, nothing more, as highlighted in issues #61 and #82. This bring to several problems:
Both the issues are the consequence of considering newline as meaningful for text buffering. It is not. It should instead be considered as a normal spacing as would be a tab or a single space character. Ideally we should be able to parse a dbc that lays on a single line or if you will, this
Keep this in mind as a mental test/proof.
Stategies
I have been thinking for a while now on how to solve this and I came up with 3 possible ways that we will test in the next weeks to find out the best candidate with respect to the following (ordered) criteria
Approach 1: Try to keep regex parsers, change text provider
Today we have something like
which means that it does not support partial read. In the one line example above
BU_
parser may return true but anything fromBO_
onwards may be ignored or parsed the wrong way. Keep in mind that regex\s
includes newline. The idea with this approach is to keep regex parsing as it is today (at least most of it) while improving its application by notifying if the match was just for a substring and basically keeping track of the index. Probably we won't be able to keepINextLineProvider
as it is right now. The hope is to satisfy criteria 2 and 1Approach 2: Switch to a lexer/parser with standard tools
This means a complete rewrite but would shift the scope from building a text parser to maintain a lexer grammar using standard well documented tools (take a look at this). With this I hope to reduce code complexity in the codebase, allow extensions just for the grammar definition, sacrificing criteria 2 for the investment in the future. This approach is used also in other tools (here)
Approach 3: Manual parsing, decouple data buffering and read
This is more manual approach, but starts from the idea that dbc is very simple, it's basically a bunch of keywords and some space separated text. The idea is: I have something to handle text/buffer a bit more complicated than current
INextLineProvider
something more like aTextReader
but with convenient methods for searching/scanning. As an example: first I need to find the first keyword (BU_
) and then delegate to its parser the handling. The parser would ask the data reader something likewhich means give me next substring composed of alphanumeric but could also have
The important thing to note here is that the parser has the responsibility to know which chars are delimiters for its own parsing while the data provider gives substring back but no one knows about lines. So a comment could request a sort of read from
"
to"
and the substring coming back is the comment body, that could well contain new lines. I expect this to work fine for well delimited keywords (i.e. those that require;
at the end) a bit more of a struggle for those who don't (e.g.BU_: Engine Gateway BO_ 100
how I decide thatBO_
here is not a node but a new starting point?) That would solve the multiline support of course (VAL_
or whatever would look for a final;
not caring or even knowing if it lays on the same or a different line). Under the hood the file could still be read line by line to help in case of syntax error, but this becomes an implementation detail for parsers (the decoupling I want).Bottom line
Any feedback is welcome, I would personally start from strategy 1 and 3 just because they are code only while the 2 requires external tools I'm not very experienced with. We can develop new parsers almost side by side with current one since I expect the
IDbcBuilder
to stay valid. We should simply add static methods inParser
class (e.g.ParseWithAdvancedRegex()
orParseWithManualSeek()
stuff like that) This would allow testing (unit but also manual) or performance given we have the benchmarking test project now.Thanks A