Unwanted characters received after parsing

uNouss commented 1 year ago

Hi, I have this file:

      PROGRAM MYPROG
         IF (I .LT. 2) THEN
            I = 3
         END IF
      END

When I parse it I get this result and there are two of these two characters :| that appear in the resulting AST.

Here the AST

```hs ProgramFile {programFileMeta = MetaInfo {miVersion = Fortran77, miFilename = "./fortran77.f"}, programFileProgramUnits = [PUMain () (1:7)-(5:9) (Just "myprog") [BlIf () (2:10)-(4:15) Nothing Nothing ((ExpBinary () (2:14)-(2:21) LT (ExpValue () (2:14)-(2:14) (ValVariable "i")) (ExpValue () (2:21)-(2:21) (ValInteger "2" Nothing)), [BlStatement () (3:13)-(3:17) Nothing (StExpressionAssign () (3:13)-(3:17) (ExpValue () (3:13)-(3:13) (ValVariable "i")) (ExpValue () (3:17)-(3:17) (ValInteger "3" Nothing)))]) :| []) Nothing Nothing] Nothing]} ```

The command I used for compilation: fortran-src -v77l file.f and fortran-src file.f

EDIT

I don't know if those characters are there by design.

As I understand it, this is the constructor of a non-empty list.

But is it necessary for this to appear in the AST?

raehik commented 1 year ago

That is the AST as displayed by its default Show instance. Non-empty lists are Shown with their constructor. We could override this with a newtype over non-empty lists, but it wouldn't be ideal. Show instances are usually best as debug tools anyway.

Are you perhaps parsing this AST representation? If so you might look at serializing to JSON via fortran-src-extras, then parsing that. Non-empty lists are serialized to regular JSON lists there (and you can parse structure instead of syntax).

Note that IF blocks changed representation in https://github.com/camfort/fortran-src/commit/0bc7353f63358b07d9ca43a9d5a31ba02c4930ef (not just turning to non-empty lists).

uNouss commented 1 year ago

Yes, I try to analyse the AST.

In fact, it would be better to analyse the structure rather than the syntax of the AST, especially as the syntax can change faster than the structure itself (in my opinion).

That's why I did this issue #215 to serialise the AST to XML or JSON.

But the implementation that was done was less precise (in particular the lack of tree node types) than the one produced by the AST, so I stuck with the idea of parsing the AST.

But I added support for those characters, so it's fine. I thought it was a mistake. It shows that I don't know anything about Haskell, hence the misunderstanding.

raehik commented 1 year ago

the implementation that was done was less precise (in particular the lack of tree node types) than the one produced by the AST

That's a shame -- I was hoping that the JSON representation wouldn't impact this. You still have typing, but where it's not required explicitly, constructors aren't printed. Let me give a quick example:

`cabal run -- fortran-src-extras serialize -v 90 -t yaml encode FILE`

```yaml meta: miFilename: miVersion: fortran90 program_units: - anno: [] blocks: - anno: [] blocks: null conditions: - - anno: [] left: anno: [] span: (2:14)-(2:14) tag: value value: contents: i tag: variable op: tag: lt right: anno: [] span: (2:21)-(2:21) tag: value value: contents: - '2' - null tag: integer span: (2:14)-(2:21) tag: binary - - anno: [] label: null span: (3:13)-(3:17) statement: anno: [] expression: anno: [] span: (3:17)-(3:17) tag: value value: contents: - '3' - null tag: integer span: (3:13)-(3:17) tag: assign_expression target: anno: [] span: (3:13)-(3:13) tag: value value: contents: i tag: variable tag: statement end_label: null label: null name: null span: (2:10)-(4:15) tag: if name: myprog span: (1:7)-(5:9) subprograms: null tag: main ```

Reformatting, we can see

program_units:
- blocks:
  - tag: if
    conditions: [ ... ]
    blocks: null # else clause
    span: (2:10)-(4:15)

which corresponds to BlIf. The prefix is omitted because it's inherent: the object is an element in a list of blocks. I suppose it may mean more context in parsing, since you have to think about the structure of the parent element. I generally like this, since a "global" parser that just looks at tag every time to decide the element is very fragile.

Does this work for your use case? I'd be interested in making it work if not.

RaoulHC commented 1 year ago

+1, using a properly serialised output is going to be a much better approach than trying to parse fortran-src stdout, if only because you should be able to use libraries for standard serialisation formats rather than writing an entire parser for it. Plus I don't think we should really make any guarantee about the format, as @raehik said it's just for humans to look at for debugging.

uNouss commented 1 year ago

@raehik @RaoulHC

Does this work for your use case? I'd be interested in making it work if not.

I will do some tests on more complete examples to give you feedback.

In fact, during my internship I had written a parser that parsed the output of fortran-src, which was in version 0.9, and the switch to version 0.12.0 forced me to rewrite that parser because the output I was parsing was not the same.

So yes, it is a problem that the output format is not guaranteed. In that sense, having a properly serialised representation of the AST in json with a more stable structure would be a big plus for me.

I'll do the tests in the next few days and give you feedback on possible improvements.

Anyway, thanks for the feedback.

uNouss commented 1 year ago

Hello, I did some more tests with other files. So far the json serialiser seems to work fine. So it might be a better alternative than relying on the parser output. If I find things that don't work, I'll open issues.

^^

raehik commented 1 year ago

Fantastic! Glad to hear @uNouss . In that case I'll close this issue.

camfort / fortran-src

Unwanted characters received after parsing #258

EDIT