Parallel Lexing. - Githubissues

WilliamDue commented 1 year ago

Currently only the RegEx can be parsed, the NFA will be needed to be constructed next.

athas commented 1 year ago

Implementing the NFA/DFA translation as a learning exercise is a good idea, but don't get too optimistic: the real challenge is how to execute an arbitrary NFA (or DFA) in parallel.

WilliamDue commented 1 year ago

@athas can you help me out? I am using the lastest version of Futhark (26.0 prerelease from today). When I generate json.fut using cabal run alpacc -- grammars/json.cg -k 3 and open the file in the REPL futhark repl json.fut and do parser.transitions "test" |> parser.solve_transitions I get the error.

Internal compiler error (unhandled IO exception).
Please report this at https://github.com/diku-dk/futhark/issues
transitions_size is not bound to a value.

CallStack (from HasCallStack):
  error, called at src/Language/Futhark/Interpreter.hs:654:7 in futhark-0.26.0-8TxRAg15nJlFGbXOt63c7o:Language.Futhark.Interpreter

Should I listen to the error or is there something else at play.

athas commented 1 year ago

You should listen to the error. It is a compiler bug, as the errors explicitly tells you. My guess is a corner case in the interpreter.

WilliamDue commented 1 year ago

@athas I think I am a bit stuck, and wondered if you could help me out. I am able to construct a table like the one in figure 5 [1] which gives the transitions taken in the DFA. An example of doing so is creating the S-expression grammar cabal run alpacc -- grammars/sexp.cg and entering the REPL with.

[0]> "test" |> parser.transitions |> parser.solve_transitions
[[(0, 0), (1, 0), (2, 1), (3, 1), (4, 4)],
 [(0, 0), (1, 0), (2, 1), (3, 1), (4, 4)],
 [(0, 0), (1, 0), (2, 1), (3, 1), (4, 4)],
 [(0, 0), (1, 0), (2, 1), (3, 1), (4, 4)]]

The problem is I would want to reduce this to find the DFA path they specify in figure 6 [1]. So I have defined an initial state.

def initial_state : state_type = 3

Which I would then use to find the path taken [(3, 1), (1, 0), (0, 0), (0, 0)]. The problem here is I am not sure how this is done in log(n) parallel time.

Also the paper mention extensions to the algorithm for tokenizing and I am not quite sure if I understand the idea. I think I want to create a table which maps a transition (state_type, state_type) to a set of terminals? And then somehow solve for which of the overlapping terminals are prioritized.

[1] Hill, J.M.D., Parallel lexical analysis and parsing on the AMT distributed array processor, Parallel Computing 18 (1992) 699-714.

athas commented 1 year ago

I'll read the paper and get back to you, but it might take a few days.

WilliamDue commented 1 year ago

I'll read the paper and get back to you, but it might take a few days.

Awesome thank you! I have plenty of time.

WilliamDue commented 1 year ago

I'll read the paper and get back to you, but it might take a few days.

Btw about this problem, what I have done for now is use a sequential version.

  def path [n] (trans: [n][G.transitions_size]transition_type) : [n]transition_type =
    (loop result = (replicate n (-1, -1), G.initial_state) for i < n do
      let old_arr = result.0
      let old_st = result.1
      let new_arr = old_arr with [i] = trans[i][old_st]
      let new_st = new_arr[i].1
      in (new_arr, new_st)
    ) |> (.0)

Also I think it is getting really close to be working, but it probably isn't that close yet since I have not set up tests yet. I think the last part is finding a way to group the token indexes together correctly. For the sexp.cg grammar the lexer currently outputs.

[1]> parser.lexer "((test))"
[3, 3, 2, 2, 2, 2, 4, 4]

But it should be.

[1]> parser.lexer "((test))"
[3, 3, 2, 4, 4]

athas commented 1 year ago

I am slightly worried about that function, as it is sequentially applying transitions. To my knowledge, the usual way to parallelise such a thing is to represent map each input element as a transition function, and then do a reduction with function composition. This can be effective if you can find an efficient way to represent the transition function (it obviously cannot be a Futhark-level function). I used that technique in last year's Advent of Code: https://futhark-lang.org/blog/2022-12-10-case-study.html

However, I don't offhand see how the transition function won't end up being too large to be efficient.

WilliamDue commented 1 year ago

Strange I thought it would be easier, since they do not really elaborate on how to reduce it in the paper.

WilliamDue commented 1 year ago

I think I have found a solution. I think it uses the ideas you mentioned. I add in a vector of initial states in the beginning of the table. Then I do a scan using function composition on the vectors of transitions. I believe this will always result in each row in the table becoming the same path which is the one that starts at the initial state.

WilliamDue commented 1 year ago

I think I have the parallel lexing down now, it looks like I can produce the correct sequence of terminals using the lexer function. I should just also make it so it finds the substring span of each terminal. And I also need to find a good way of testing it, maybe property based testing is an idea again.

@athas can you give me some input on a design choice? I would want to filter out some tokens like the spaces in the case of the S-expression grammar. How should this be done in the .cg file? Or is there a better idea then doing this.

athas commented 1 year ago

A simple solution would be to say that a terminal named whitespace is always removed before parsing.

WilliamDue commented 1 year ago

I have run into something weird I think.

[34]> let x = parser.lexer "uuv" |> from_maybe ([],[]) |> (.0)                                                                           
[35]> :t x
[0]u32
[36]> x   
[2, 2, 3]

I will try and investigate this when I have more energy. source: parser_llp11_0.zip

WilliamDue commented 1 year ago

@athas I think at this stage the lexer generator works. I have some tests in futhark-tests/lexer-tests with some overlapping patterns tests that passes. I have added a flag --lexer which generates only the lexer with a lex entry which return a [n][3]u64 array where the first index in each [3]u64 array is the token type, the second is the index where the token starts and the third is the index where the token ends.

Is this implementation what you would expect from a lexer generator? And also when I solve for a path through the DFA I do a scan on a 2D array which is of size (number of states) * (string input size). I have chosen to make this 2D array as small as possible by minimizing the number of DFA states. This results in the need for marking every edge by the set of tokens it could belong to. Using this I do a intersection scan and find the most prioritized token in the sets (I think this is the proper way of solving it but I should probably ask Torben). As an alternative I could just not minimize the DFA and not do the intersection scan. I was wondering if you think this minimization of the DFA is worth it? Since I know the DFA can be exponentially large but I do not know how likely it is.

WilliamDue commented 1 year ago

Nevermind I have just realized I have not tested this well enough since probably some of binary operations I use in the scans are not associative.

athas commented 1 year ago

Make sure to also check whether the neutral elements are actually neutral. It's a very easy mistake to make.

WilliamDue commented 1 year ago

When using the OpenCL backend I can get the tests to pass now. I believe the problem was when I was doing the intersection scan the operation was not associative. Now I just map a intersection reduce over the token spans, so in best case it has O(1) and in worst case O(log n) parallel time complexity. Also I do not know if this Futhark code is any good.

athas commented 1 year ago

It is generally good practice to move complexity into a map you do before the scan or reduce. An operator with complex control flow is usually not associative (although exceptions exist).

On large input (e.g. maybe 100MiB of input), do you see any speedup with the opencl backend compared to the c backend?

WilliamDue commented 1 year ago

@athas It does make a difference but I hope it i as good enough difference. Using the C backend the real time was 1m31.8s on a Ryzen 5 1600X and when using the OpenCL backend the real time was 0m18.9s using a GTX 3060.

I used this little script to generate some random s-expression tokens and used your txt2fut.py script to convert it.

import random
import string

def main():

    result = ''

    while len(result) < 100 * 2**20:
        i = random.randint(0, 3)

        if i == 0:
            result += ' ' * random.randint(1, 5)
        elif i == 1:
            result += ''.join(random.choices(string.ascii_uppercase +
                              string.ascii_lowercase, k=random.randint(1, 20)))
        elif i == 2:
            result += ')'
        elif i == 3:
            result += '('

    with open('code.lisp', 'a') as fp:
        fp.write(result)

if __name__ == '__main__':
    main()

WilliamDue commented 1 year ago

Also the --lexer flag does not work at the moment for some reason.

athas commented 1 year ago

Try with the cuda backend. It has much faster scans.

WilliamDue commented 1 year ago

Try with the cuda backend. It has much faster scans.

With the CUDA backend I got 0m14.3s, I am making a 1GiB file and i should probably be using futhark bench instead of time(0).

athas commented 1 year ago

Yes, although at those runtimes I'm not sure the difference is significant.

WilliamDue commented 1 year ago

I can not test the 1GiB file because it would use too much memory. I think I will have to find a way to save some memory when I construct the (number of states) * (string input size) 2D array.

athas commented 1 year ago

The main cost centre by far is the scan with compose_transition_vectors_identity. Scans generally work badly when the operator is complicated, and particularly when the element type is an array. This one almost works because an operator that is just a map2 is generally quite efficient, but then you have that #nothing-based identity element. If you could figure out a way to encode the identity element such that the operator is a map2, that would likely be a lot faster.

athas commented 1 year ago

Alternatively, using tuples instead of arrays might help. Although with large tuples you may run into other problems. It's difficult to predict what would be best. Perhaps the builtin scan should not be used, and instead a hand-rolled work efficient scan should be employed. Those are normally less efficient, but can be better when the operator contains parallelism.

WilliamDue commented 1 year ago

I think I understand the ideas kinda, will look into it on Monday maybe. Also if you want to and have time for it, feel free to push changes to this branch! 😃

WilliamDue commented 1 year ago

I believe there is no natural occurring identity for combine_transitions but I think, I could simplify the combine function which is used in combine_transitions to.

type transition = state

def combine (_ : transition) (a : transition) : transition =
    a

Since in each transition x -> y the x is actually just the index, this helps on memory usage.

Something I am questioning if it is an optimization is moving the identity inside the combine function in combine_transitions?

def combine_transitions [n] (a : [n](maybe transition)) (b : [n](maybe transition)) : [n](maybe transition) =
    map2 (add_identity combine) a b

It is possible to change compose_transition_vectors_identity such that it does not need to use the boolean which specifies if its the final transition. The change would make it so the function would only use function composition so a neutral element should be able to be found which is just (0..L.number_of_states). This results in the regular expression being changed. Currently the regular expression used for sexp.cg as an example is \(|\)|[a-z]+|(\s|\n|\t) here every time a final states is reached in the DFA (the end boolean) then the DFA walk is restarted. The change done to the regular expression will be (\(|\)|[a-z]+|(\s|\n|\t))* and then the accepting states should be solved for after the walk. I have had implemented something like this before, it worked every time I did some small tests but I do not know if it actually works.

athas commented 1 year ago

Yes, that would be better. If at all possible, reduction and scan operators should be on scalars (or tuples of scalars), or of the form map2 f where f is then on scalars or tuples of scalars.

WilliamDue commented 1 year ago

I simplified the scans quite a bit but it only resulted in 2 seconds speed up using the CUDA backend. I am guessing compose_transitions_vectors should probably be using tuples? Current compose_transition_vectors implementation.

def compose_transition_vectors (a : state_vector) (b : state_vector) : state_vector =
    map (\s ->
      b[state_module.to_i64 s]
    ) a

WilliamDue commented 1 year ago

I simplified the scans quite a bit but it only resulted in 2 seconds speed up using the CUDA backend. I am guessing compose_transitions_vectors should probably be using tuples? Current compose_transition_vectors implementation.
def compose_transition_vectors (a : state_vector) (b : state_vector) : state_vector =
    map (\s ->
      b[state_module.to_i64 s]
    ) a

I have tried this now. It did not result in betters speeds but it got a second slower it seems.

WilliamDue commented 1 year ago

If I run it a few times and pick the best time with the CUDA backend on the 100MiB dataset I get some better times, so it is might be closer to a 3-4 seconds improvement. The best time I could get.

real    0m10.286s
user    0m8.530s
sys     0m1.320s

I think some of the maps could be put together to one function that might also help.

athas commented 1 year ago

Use futhark bench for benchmarking. Don't roll your own.

WilliamDue commented 1 year ago

I think a memory optimization I could do is packing the vector of states into a single u64 since each state can most likely be expressed using a u8 I think.

WilliamDue commented 1 year ago

Use futhark bench for benchmarking. Don't roll your own.

Last time I tried to use this, it took quite a long time, would it be fine to down size the dataset?

athas commented 1 year ago

Sure, but it takes a long time because it tries to do a good job.

WilliamDue commented 1 year ago

Sure, but it takes a long time because it tries to do a good job.

Now I feel bad for being an impatient little prick.

WilliamDue commented 1 year ago

I think my code needs to be more memory efficient since sometimes I get an out of memory error when running the code. I have 12GiB of VRAM so I am quite sure my code is bad. Also when I try to run futhark bench the benchmark is killed too, I have also tried doing the benchmark on a Titan RTX now and the benchmark is killed there too. slurm says

some of your processes may have been killed by the cgroup out-of-memory handler.

So I am guessing the benchmark dies here also because of the lack of memory.

WilliamDue commented 1 year ago

Also the A40 deals with it as well as the Titan RTX.

WilliamDue commented 1 year ago

Nevermind I am being dumb, the benchmark being killed is related to RAM usage... I did use to get a CUDA error saying something about it not being able to allocate memory on "device" so I thought the error was related to VRAM.

WilliamDue commented 1 year ago

Some dumb mistakes later... here are the CUDA benchmark results @athas .

Compiling sexp.fut...
Reporting arithmetic mean runtime of at least 10 runs for each dataset (min 0.5s).
More runs automatically performed for up to 300s to ensure accurate measurement.

sexp.fut (no tuning file):
large.in:      84623μs (95% CI: [   84544.7,    84697.6])

Edit 0: Wait or maybe this is wrong, do I have to specify the output? Also this was done on a A40. Edit 1: Arh even if I specify the output it does not make a difference. It is surprisingly fast, I get why I should not use time(0).

WilliamDue commented 1 year ago

Using the 1GiB dataset on a A40 I get.

Compiling sexp.fut...
Reporting arithmetic mean runtime of at least 10 runs for each dataset (min 0.5s).
More runs automatically performed for up to 300s to ensure accurate measurement.

sexp.fut (no tuning file):
huge.in:    3061740μs (95% CI: [ 3059687.4,  3064417.4])

I think I have some decent lexer tests currently but I guess they could be better.

athas commented 1 year ago

How big is huge.in?

WilliamDue commented 1 year ago

How big is huge.in?

It should be a little over 1GiB

athas commented 1 year ago

So 300MiB/s? I must admit I have little intuition for whether that is fast, as far as parsing goes. Probably not, but there is still work to do in speeding this up.

WilliamDue commented 1 year ago

So 300MiB/s? I must admit I have little intuition for whether that is fast, as far as parsing goes. Probably not, but there is still work to do in speeding this up.

It probably is not that great if the logos benchmarks are comparable.

WilliamDue commented 1 year ago

I lost the original dataset (which is not using the futhark data format) so i generated a new 1GiB file. I used this to compare it with the logos Rust lexer library which took 4562053μs to do the lexing. I made sure to use optimization level 3 when making the release build for the test.

I am no Rust expert but I hope this is a fair comparison.

use std::fmt;
use std::io::Read;
use logos::Logos;
use std::fs::File;
use std::time::SystemTime;

#[derive(Logos, Debug, PartialEq)]
#[logos(skip r"[ \t\n]+")] // Ignore this regex pattern between tokens
enum Token {
    #[regex("[a-zA-Z]+")]
    Atom,
    #[token("(")]
    LParen,
    #[token(")")]
    RParen,
}

impl fmt::Display for Token {
    fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
        let token: &str = match self {
            Token::Atom => { "Atom" }
            Token::LParen => { "LParen" }
            Token::RParen => { "RParen" }
        };
        write!(f, "{}", token)
    }
}

fn main() {
    let mut file = File::open("raw-huge.in").unwrap();
    let mut data = vec![];
    let _ = file.read_to_end(&mut data);
    let s = String::from_utf8(data).unwrap();

    let mut lex = Token::lexer(&s);
    let mut result = Vec::new();

    let now = SystemTime::now();
    let mut step = lex.next();
    while step != None {
        match step.unwrap() {
            Ok(token) => {
                result.push((token, lex.span()));
            }
            Err(e) => {
                println!("Error: {e:?}");
                break;
            }
        }
        step = lex.next()
    }
    // println!("{:?}", result);
    match now.elapsed() {
        Ok(elapsed) => {
            println!("Time elapsed: {}μs", elapsed.as_micros());
        }
        Err(e) => {
            println!("Error: {e:?}");
        }
    }
}

WilliamDue commented 1 year ago

In my glorious adventure of trying to optimize my code I found some error while changing the lex function.

Compiling sexp.fut...
Reporting arithmetic mean runtime of at least 10 runs for each dataset (min 0.5s).
More runs automatically performed for up to 300s to ensure accurate measurement.

sexp.fut:lex (no tuning file):
huge.in: 
Unknown error.  This is a compiler bug.

Failed to run sexp.fut
./sexp: sexp.c:8031: CUDA call
  gpu_free_all(ctx)
failed with error code 1 (invalid argument)

CallStack (from HasCallStack):
  error, called at src/Futhark/Server.hs:187:7 in futhark-server-1.2.2.1-6bIJRa1PKd9DOkMNzFraOf:Futhark.Server

No such problem happens with the OpenCL backend currently.

The OpenCL benchmark:

huge.in:    2641877μs (95% CI: [ 2576615.3,  2700206.8])

It is about 0.4s faster so far.

WilliamDue commented 1 year ago

Also I can try to narrow down what code causes the bug on Saturday maybe.

WilliamDue commented 1 year ago

I also get this error.

Compiling sexp.fut...
Reporting arithmetic mean runtime of at least 10 runs for each dataset (min 0.5s).
More runs automatically performed for up to 300s to ensure accurate measurement.

sexp.fut:lex (no tuning file):
large.in: 
sexp.c:8084: CUDA call
  cuMemcpyDtoH_v2(dst, src + offset, size)
failed with error code 700 (an illegal memory access was encountered)

Failed to run sexp.fut
./sexp: sexp.c:8031: CUDA call
  gpu_free_all(ctx)
failed with error code 1 (invalid argument)

CallStack (from HasCallStack):
  error, called at src/Futhark/Server.hs:187:7 in futhark-server-1.2.2.1-6bIJRa1PKd9DOkMNzFraOf:Futhark.Server

diku-dk / alpacc

Parallel Lexing. #6