anitsh commented 4 years ago

Resource

[ ] https://github.com/jamiebuilds/the-super-tiny-compiler

anitsh commented 3 years ago

Lexing or Lexical Analysis or Tokenization

Lexical analysis is the process of converting a sequence of characters (such as in a computer program or web page) into a sequence of tokens (strings with an assigned and thus identified meaning). A program that performs lexical analysis may be termed a lexer, tokenizer, or scanner, although scanner is also a term for the first stage of a lexer. A lexer is generally combined with a parser, which together analyze the syntax of programming languages, web pages, and so forth. Lexers are generally quite simple, with most of the complexity deferred to the parser.

Lexing can be divided into two stages: the scanning, which segments the input string into syntactic units called lexemes and categorizes these into token classes; and the evaluating, which converts lexemes into processed values.

A lexical token or simply token is a string with an assigned and thus identified meaning. It is structured as a pair consisting of a token name and an optional token value. The token name is a category of lexical unit. Common token names are:

identifier: names the programmer chooses;
keyword: names already in the programming language;
separator (also known as punctuators): punctuation characters and paired-delimiters;
operator: symbols that operate on arguments and produce results;
literal: numeric, logical, textual, reference literals;
comment: line, block (Depends on the compiler if compiler implements comments as tokens otherwise it will be stripped).

Token name	Sample token values
identifier	x, color, UP
keyword	if, while, return
separator	}, (, ;
operator	+, <, =
literal	true, 6.02e23, "music"
comment	/ Retrieves user data /, // must be negative

Consider this expression in the C programming language: x = a + b * 2;

The lexical analysis of this expression yields the following sequence of tokens: [(identifier, x), (operator, =), (identifier, a), (operator, +), (identifier, b), (operator, *), (literal, 2), (separator, ;)]

A token name is what might be termed a part of speech in linguistics.

Tokenization is the process of demarcating and possibly classifying sections of a string of input characters.

The specification of a programming language often includes a set of rules, the lexical grammar, which defines the lexical syntax. For example, in the text string: The quick brown fox jumps over the lazy dog The tokens could be represented in XML,

<sentence>
  <word>The</word>
  <word>quick</word>
  <word>brown</word>
  <word>fox</word>
  <word>jumps</word>
  <word>over</word>
  <word>the</word>
  <word>lazy</word>
  <word>dog</word>
</sentence>

When a token class represents more than one possible lexeme, the lexer often saves enough information to reproduce the original lexeme, so that it can be used in semantic analysis. The parser typically retrieves this information from the lexer and stores it in the abstract syntax tree. This is necessary in order to avoid information loss in the case of numbers and identifiers.

Semantic analysis or context sensitive analysis is a process in compiler construction, usually after parsing, to gather necessary semantic information from the source code. It usually includes type checking, or makes sure a variable is declared before use which is impossible to describe in the extended Backus–Naur form and thus not easily detected during parsing.

anitsh commented 3 years ago

Parsing, Syntax Analysis, or Syntactic Analysis

Parsing is the process of analyzing a string of symbols, either in natural language, computer languages or data structures, conforming to the rules of a formal grammar.

In computer science refer to the formal analysis by a computer of a sentence or other string of words into its constituents, resulting in a parse tree showing their syntactic relation to each other, which may also contain semantic and other information (p-values). Some parsing algorithms may generate a parse forest or list of parse trees for a syntactically ambiguous input.

The term is used in the analysis of computer languages, referring to the syntactic analysis of the input code into its component parts in order to facilitate the writing of compilers and interpreters. The term may also be used to describe a split or separation.

A parse tree or parsing tree or derivation tree or concrete syntax tree is an ordered, rooted tree that represents the syntactic structure of a string according to some context-free grammar. The term parse tree itself is used primarily in computational linguistics; in theoretical syntax, the term syntax tree is more common.

Concrete syntax trees reflect the syntax of the input language, making them distinct from the abstract syntax trees used in computer programming. Unlike Reed-Kellogg sentence diagrams used for teaching grammar, parse trees do not use distinct symbol shapes for different types of constituents. Parse trees are usually constructed based on either the constituency relation of constituency grammars (phrase structure grammars) or the dependency relation of dependency grammars.

[ ] https://dzone.com/articles/parsing-in-java-part-1-structures-trees-and-rules
[ ] https://github.com/antlr/antlr4
[ ] Parsing Explained - Computerphile https://www.youtube.com/watch?v=bxpc9Pp5pZM
[ ] Parsing Bottom Up - Computerphile https://www.youtube.com/watch?v=tH5AOX9929g
[ ] Functional Parsing - Computerphile https://www.youtube.com/watch?v=dDtZLm7HIJs
[ ] https://www.eclipse.org/articles/Article-JavaCodeManipulation_AST
[ ] https://en.wikipedia.org/wiki/Visitor_pattern
[ ] https://stackoverflow.com/questions/9565073/java-abstract-syntax-tree
[ ] https://ruslanspivak.com/lsbasi-part7
[ ] https://dzone.com/articles/parsing-in-java-part-1-structures-trees-and-rules

anitsh commented 3 years ago

anitsh / til

Lexer, Parser, Compiler #57

Resource

Lexing or Lexical Analysis or Tokenization

Parsing, Syntax Analysis, or Syntactic Analysis

Compiler