EFeru / DbcParser

.NET CAN bus dbc file parser
MIT License
72 stars 26 forks source link

Parser rewrite to be more robust and not influenced by indenting and new lines. #88

Open Adhara3 opened 1 week ago

Adhara3 commented 1 week ago

Foreword

When we begun contributing in this library we assumed that this way of writing a dbc had a meaning

BU_: Engine Gateway

BO_ 100 EngineData: 8 Engine
 SG_ PetrolLevel : 24|8@1+ (1,0) [0|255] "l" Gateway
 SG_ EngPower : 48|16@1+ (0.01,0) [0|150] "kW" Gateway
 SG_ EngForce : 32|16@1+ (1,0) [0|0] "N" Gateway
 SG_ IdleRunning : 23|1@1+ (1,0) [0|0] "" Gateway
 SG_ EngTemp : 16|7@1+ (2,-50) [-50|150] "degC" Gateway
 SG_ EngSpeed : 0|16@1+ (1,0) [0|8000] "rpm" Gateway 

A parser that goes line by line seemed the best solution to achieve an efficient file read and an easy parsing code. Time proved we were wrong, the newlines like the indenting is just for human readability, nothing more, as highlighted in issues #61 and #82. This bring to several problems:

  1. One line may not contain a complete keyword body but only a part of if (multiline support missing)
  2. One line may contain more than one complete keyword body

Both the issues are the consequence of considering newline as meaningful for text buffering. It is not. It should instead be considered as a normal spacing as would be a tab or a single space character. Ideally we should be able to parse a dbc that lays on a single line or if you will, this

BU_: Engine Gateway BO_ 100 EngineData: 8 Engine SG_ PetrolLevel : 24|8@1+ (1,0) [0|255] "l" Gateway SG_ EngPower : 48|16@1+ (0.01,0) [0|150] "kW" Gateway SG_ EngForce : 32|16@1+ (1,0) [0|0] "N" Gateway SG_ IdleRunning : 23|1@1+ (1,0) [0|0] "" Gateway SG_ EngTemp : 16|7@1+ (2,-50) [-50|150] "degC" Gateway SG_ EngSpeed : 0|16@1+ (1,0) [0|8000] "rpm" Gateway 

Keep this in mind as a mental test/proof.

[!WARNING] It must be clear that the aim is not solving the multiline support (which would be impossible in a complete way with current parser) but to change the approach to source read and parsing to unlock several missing features among which we can list multiline support. We are not cutting any corner here.

Stategies

I have been thinking for a while now on how to solve this and I came up with 3 possible ways that we will test in the next weeks to find out the best candidate with respect to the following (ordered) criteria

  1. Code readability, maintainability and testability
  2. Migration effort
  3. Performance (of course with some smart approach. We won't tolerate a parser that takes 2 minutes, but the idea is not to fight to much for hundreds of milliseconds, to get an idea)

Approach 1: Try to keep regex parsers, change text provider

Today we have something like

bool Parse(INextLineProvider provider);

which means that it does not support partial read. In the one line example above BU_ parser may return true but anything from BO_ onwards may be ignored or parsed the wrong way. Keep in mind that regex \s includes newline. The idea with this approach is to keep regex parsing as it is today (at least most of it) while improving its application by notifying if the match was just for a substring and basically keeping track of the index. Probably we won't be able to keep INextLineProvider as it is right now. The hope is to satisfy criteria 2 and 1

Approach 2: Switch to a lexer/parser with standard tools

This means a complete rewrite but would shift the scope from building a text parser to maintain a lexer grammar using standard well documented tools (take a look at this). With this I hope to reduce code complexity in the codebase, allow extensions just for the grammar definition, sacrificing criteria 2 for the investment in the future. This approach is used also in other tools (here)

Approach 3: Manual parsing, decouple data buffering and read

This is more manual approach, but starts from the idea that dbc is very simple, it's basically a bunch of keywords and some space separated text. The idea is: I have something to handle text/buffer a bit more complicated than current INextLineProvider something more like a TextReader but with convenient methods for searching/scanning. As an example: first I need to find the first keyword (BU_) and then delegate to its parser the handling. The parser would ask the data reader something like

// Pseudocode
string GetNextToken();

which means give me next substring composed of alphanumeric but could also have

// Pseudocode
string GetNext(Predicate<char> startChar, Predicate<char> endChar);
void SkipWhile(Predicate<char> skip);

The important thing to note here is that the parser has the responsibility to know which chars are delimiters for its own parsing while the data provider gives substring back but no one knows about lines. So a comment could request a sort of read from " to " and the substring coming back is the comment body, that could well contain new lines. I expect this to work fine for well delimited keywords (i.e. those that require ; at the end) a bit more of a struggle for those who don't (e.g. BU_: Engine Gateway BO_ 100 how I decide that BO_ here is not a node but a new starting point?) That would solve the multiline support of course (VAL_ or whatever would look for a final ; not caring or even knowing if it lays on the same or a different line). Under the hood the file could still be read line by line to help in case of syntax error, but this becomes an implementation detail for parsers (the decoupling I want).

Bottom line

Any feedback is welcome, I would personally start from strategy 1 and 3 just because they are code only while the 2 requires external tools I'm not very experienced with. We can develop new parsers almost side by side with current one since I expect the IDbcBuilder to stay valid. We should simply add static methods in Parser class (e.g. ParseWithAdvancedRegex() or ParseWithManualSeek() stuff like that) This would allow testing (unit but also manual) or performance given we have the benchmarking test project now.

[!NOTE] I would avoid here a philosophical discussion, I would be very pragmatic, time is not much, this is not a job but a free time help, so I need to see things running

Thanks A

Uight commented 6 days ago

I scanned a bit more through the can definition and im not completly sure but i think that there is a hidden rule.

So BO and SG dont need a line termination but the definition does not include any other keywords. All definitions that can include another Keyword like a comment or BA_DEFDEF do require a line termination.

If thats true you can read to the next keyword or to ; depending on the currenlty parsed definition.

Adhara3 commented 5 days ago

Approach 2: Update

I found some examples, especially this one which uses ANTLR and thus contains a grammar file. Comments in this grammar file are also interesting, also this note (I felt less alone).

Unfortunately, I couldn't test the grammar yet (I'm struggling with Java) and it doesn't work either in ANTLR LAB. The only thing I know is that going down this path would require adding a third party nuget dependency to DbcParserLib in order to work, which may be a bad thing.

ASAM MCD-2 MC interface description files (*.a2l) if the target platform shall be connected to a measurement and calibration tool like ETAS INCA or Vector Informatik CANape

That was asked here in #49

Another interesting concept is here when it says

The only additional reusable piece of code is the class SemanticCheckListener, which performs a (simple) semantic validation of the parse result. The antlr4 grammar does not check all static elements of the syntax definition. Forcing the grammar to do all of this would sometimes be too cumbersome. To give a single example: For obscure reasons, the name of an attribute is defined to be an identifier enclosed in double quotes - a syntax construct, which conflicts with the common character string. The grammar accepts any double quote enclosed character string for attribute name and only SemanticCheckListener will validate that the characters indeed form an identifier

which means that some checks can be performed after the pure parsing phase and it's fine to do so.

I also checked cantools which is in python and parses the file taking advantage of textparser a python lib for parsing with a defined grammar, so a similar flavour. This file contains the grammar definition for textparser lib.

Bottom line: mixed feeling. On one end this seems the preferred direction to avoid manual stuff, on the other end it requires new skills and code dependencies, so in a way we lose a bit of control on the code.

Approach 3: Update

I tried to have a look if some library could help me out, and i stumbled upn Sprache and it was fun because in this article there is a list of possible approaches which exaclty the above list. Shure this would also add a dependency, but we would keep more control over code without reinventing the wheel.

Still a work in progress, I'll keep you posted

A

Uight commented 5 days ago

Theres no problem in using a third party library as long as the license is compatible ;)

Uight commented 5 days ago

i did some tried some stuff with Sprache. My aim was to get definitions from a textfile. not like parsing the defintions themself just get them and then use the regex parsers to do the rest. I however just used some dummy texts but as poc it should work.

What i like is that you can do order independend parsing. examplte you could parse the line for a value table before the actual value table definition and then do the real parsing in the right order so it doesnt lead to a fault.

what i dont like is that if a have a invalid syntax in one definition i fail everything so the whole parse fails. But that might be something able to fix but im not sure.

i would like how other parsers work. Whats the failing condition. Like do they open what they can read like we do or just fail or complet? What do other parsers do for errorhandling? I mean i checked with cantools in python and they seem to fail out completly too if they cant parse it

Testparsing for a one line message definition (this time full parsing): https://pastebin.com/pSUG7MGG

I wanted to parse factor and offset of signals as double but that would actually not fail and just parse the message with empty signals. So it doesnt fail completly always only on the toplevel so in this case if message itself fails parsing this would get an exception. What is missing is that i have no idea how to debug this. like i said parsing as double for factor and offset doesnt work and i have currently no idea how to debug that.

Adhara3 commented 4 days ago

Hi,

I just pushed in a new branch some code that works with a new parser and tokenizer.

[!NOTE] This is by no means complete!!!!

Still a lot of work to be done, I just concentrated on having some end to end code for a couple of keywords to see if the concept could work. Yes, I'm still returning null here and there but that's note the point.

I would be glad if you could give a feedback or even try to implement a new parser on your own among the ones that are missing. It shouldn't be a complete bullet proof implementation, just to see how if goes, if you feel confortable

The main points of interest are:

Let me know if you feel this is a good path to follow

Cheers A

Adhara3 commented 4 days ago

[!Important] The parsing code should be quite user friendly to allow easy extensibility in case of need but it's not part of the public API so we can cut some corner or refactor later on for improvement without affecting user experience. Moreover, I would favor parsing correctness over everything else

My 2 cents A