antlr / grammars-v4

Grammars written for ANTLR v4; expectation that the grammars are free of actions.
MIT License
9.96k stars 3.68k forks source link

[fortran] Add latest Fortran grammar, addressing issues in older versions #4096

Open kaby76 opened 1 month ago

kaby76 commented 1 month ago

(This initial comment has been updated in order to clarify the topic. Replies to this comment may not make sense because description has changed.)

The grammars for Fortran need to be updated.

There are two Fortran grammars in this repository: fortran77 and fortran90. Unfortunately, while we could fix the old grammars, which might be of value as problems in one grammar often appear in other versions, there have been several major releases to Fortran since the 1990 Spec by the WG5.

In addition, it is not clear whether the fortran77 and fortran90 grammars are derived from the published spec or from the draft just prior to publication, and what those differences may be.

The ISO charges around $200 per copy of the downloadable PDF, and I doubt that anyone is really working directly from the published specs. The ISO Specs are copyrighted material. The PDFs are licensed to a specific individual or institutional use, with "watermarks" as to the identity of the licensee. If we plan on maintaining grammars for Fortran (or any of the other ISO programming language standards for that matter), we will need to figure out how to get regular access.

Entering in CFGs by hand from the specs is error prone, untrustworthy, and not repeatable. The procedure should be automated. The Trash Toolkit should be used to automatically scrape and refactor the CFG from any version of the Fortran spec, and added to the directory https://github.com/antlr/grammars-v4/tree/6581f29d0cb63e3e337cd6dacec6602e34aa88d3/fortran. Any bugs in the grammar can be fixed, but these changes should be addressed in the scrape/refactor procedure as well.

suehshtri commented 1 month ago

I am learning.

useStmt
    : USE NAME
    | USE NAME COMMA ONLY COLON
    | USE NAME COMMA renameList
    | USE NAME COMMA ONLY COLON onlyList
    | USE COMMA INTRINSIC DOUBLECOLON NAME
    ;

though I appreciate your idea about moving to separate grammars to match the spec.

kaby76 commented 1 month ago

I re-added tritext to Trash and added the -m option, which will add <i>...</i> and <b>...</b> tagging around grammar rules where it appears in the Fortran spec. This way one can scrape the grammar from a spec and determine whether the symbol in the grammar is a non-terminal or terminal or grammar punctuation, e.g., is, where the is is boldface so as not mistake it for anything but the LHS/RHS rule separator.

kaby76 commented 4 weeks ago

I now have a scraper implementation that extracts all the rules from the official spec ISO/IEC 1539-1:2023. Note, the spec costs ~$200, which I purchased. It seems some rules are missing from draft versions of the spec.

The scraper is done in two steps.

The first step calls tritext of the .pdf to extract all the text from the .pdf. The tool implements text indentation as seen in the spec. This is critical, because the spec does not have rule terminators, like the Antlr4 ';'. So, in order to distinguish between additional lines for the rule from non-rule text or another rule following the current rule, any lines that are part of the rule are indented as we see in the spec.

So, instead of:

R601 alphanumeric-character is letter
or digit
or underscore
Except for the currency symbol, the graphics used for the characters shall be as given in 6.1.2, 6.1.3, 6.1.4, and
6.1.5. However, the style of any graphic is not specified.
...

the tool outputs:

R601 alphanumeric-character is letter
              or digit
              or underscore
Except for the currency symbol, the graphics used for the characters shall be as given in 6.1.2, 6.1.3, 6.1.4, and
6.1.5. However, the style of any graphic is not specified.
...

We can now "see" the end of the rule with the text "Except for the ...." because the text is flush with the left-hand margin.

The second step is a program to pull out the rules from this extracted text of the spec. The code for that program is:

using System.IO;
using System.Text.RegularExpressions;

public class Program
{
    static void Main(string[] args)
    {
        string line;
        Regex rs = new Regex(@"^R\d");
        Regex re = new Regex(@"^[ ]");
        bool do_print = false;
        while ((line = Console.ReadLine()) != null)
        {
            if (rs.IsMatch(line))
            {
                do_print = true;
                Console.WriteLine(line);
            }
            else if (do_print && re.IsMatch(line))
            {
                Console.WriteLine(line);
            }
            else
            {
                do_print = false;
            }
        }
    }
}

I tried to use sed and/or awk, but I found the patterns too difficult to write, and ended up with just coding it in C#.

The script to extract the rules is thus:

tritext ISO_IEC_1539-1_2023\(en\).pdf |  ConsoleApp.exe

As the spec is copyrighted, I cannot post the extracted grammar here as is. The plan is to the use Trash to mutate the syntax and transform the grammar into Antlr4 syntax, which can be posted.

I have spent a great deal of time trying to use ChatGPT to extract the rules from the text. ChatGPT can do it, but it is extremely slow, and requires constant prompting to "continue, please" to get to the end of the entire spec. Copilot does not work at all because quotes and slashes interfere with the prompt questioning. However, LLMs are ideal for scraping because they solve "feature extraction". I have not pursued this solution further.

skelter commented 4 weeks ago

Thank you @kaby76. Extraordinary effort stymied by the Fortran vendors and ISO-publication revenue stream.
Is there a 'support my efforts' link for you that the community could help with the costs of the grammar?

teverett commented 4 weeks ago

@kaby76 @skelter count me in, I'll send a couple dollars.

kaby76 commented 3 weeks ago

Additional conversations related to this issue.

AkhilAkkapelli commented 3 weeks ago

Hi everyone,

I wanted to share that my college has access to the latest Fortran specification: ISO/IEC 1539-1 Fifth edition 2023-11. I've managed to extract the rules in a format suitable for Antlr4 from the specification. Currently, I'm working on the grammar in a step-by-step process, adding rules incrementally and testing them. I'm about one-third of the way through the code.

As part of this effort, I'm also creating a test Fortran files to verify the grammar implementation. If I encounter any issues during this process, I'll post them on Stack Overflow and GitHub with a minimal working example for discussion and resolution.

Based on my progress, I anticipate that the complete grammar might be available in some weeks, probably within a month.

kaby76 commented 3 weeks ago

It looks like the last draft spec that is available prior to the published 2018 version is https://j3-fortran.org/doc/year/18/18-007r1.pdf. That doc contains section and line numbering in the left margin, which presumably was used by the WG5 to help identify where to make corrections to a draft. tritext will need to be updated to remove this junk. Update June 6 '24: I added an option to the tritext pdf reader to filter these out. Again, it would be better to do all this using a LLM, as "Feature Extraction" is exactly what LLMs excel at. I have not released the latest of the Trash Toolkit (v0.23.1) because of regressions in trgen.

Older specs like the final draft spec for Fortran 1990, https://wg5-fortran.org/N001-N1100/N692.pdf, don't contain any text, so the PDF will need to be OCR'ed first in order to extract text.

kaby76 commented 3 weeks ago

Attached here is the rule extraction for Fortran 2018 from the last available draft at https://j3-fortran.org/doc/year/18/18-007r1.pdf using the tritext (latest not available yet) and the above program.

18-007r1.txt

This text is exactly character for character all the "R..." rules in the PDF, along with HTML markup for bold and italics.

This is successfully parsed using a custom grammar for "WG5 EBNF", which I wrote.

wg5_ebnf.g4.txt

I'm now in position to refactor this to Antlr4, all using a repeatable, automated manner.

AkhilAkkapelli commented 2 weeks ago

I have developed the Fortran 2023 Grammar based on the latest Fortran specification: ISO/IEC 1539-1 Fifth edition 2023-11. You can access the grammar in my Github: Fortran2023Grammar. Please let me know if you find any errors.

kaby76 commented 2 weeks ago

I have developed the Fortran 2023 Grammar based on the latest Fortran specification: ISO/IEC 1539-1 Fifth edition 2023-11. You can access the grammar in my Github: Fortran2023Grammar. Please let me know if you find any errors.

Thanks, I will look it over.

You should rename your grammar files with the ".g4" extension. The file extension is extensively assumed in scripts, Github, etc.

kaby76 commented 2 weeks ago

On the issue of statement labels, I would adjust any parser rules to recognize the label. This is because it will be easier to write an XPath expression for both the defining occurrence of a label (i.e., the label that occurs before any statement) and the applied occurrence of a label (e.g., in "go to 100"). In addition, you won't need to change parser rule label into a special token called LABEL.

AkhilAkkapelli commented 2 weeks ago

I encountered some grammar issues while parsing:

R865 letter-spec -> letter [- letter] as every letter is tokenised as NAME.

  • Placing the DIGITSTRING rule before the DIGIT rule in the lexer causes all single-digit numbers to be tokenized as DIGITSTRING instead of DIGIT. This creates issue for the label rule:

R611 label -> digit [digit [digit [digit [digit]]]]

The grammar could not parse the following correctly:

  1. 101 format ( F9.2 ) as format-stmt, even after adding the label, because F9 is tokenized as NAME instead of F and DIGITSTRING.

  2. a-b as it gets tokenized as LETTERSPEC instead of NAME, MINUS, LETTER.

AkhilAkkapelli commented 1 week ago

@kaby76 Could you include the grammar in grammers-v4. There are minor issues, but they can be resolved with everyone's involvement.