Closed WilliamDue closed 1 year ago
Implementing the NFA/DFA translation as a learning exercise is a good idea, but don't get too optimistic: the real challenge is how to execute an arbitrary NFA (or DFA) in parallel.
@athas can you help me out? I am using the lastest version of Futhark (26.0 prerelease from today). When I generate json.fut using cabal run alpacc -- grammars/json.cg -k 3
and open the file in the REPL futhark repl json.fut
and do parser.transitions "test" |> parser.solve_transitions
I get the error.
Internal compiler error (unhandled IO exception).
Please report this at https://github.com/diku-dk/futhark/issues
transitions_size is not bound to a value.
CallStack (from HasCallStack):
error, called at src/Language/Futhark/Interpreter.hs:654:7 in futhark-0.26.0-8TxRAg15nJlFGbXOt63c7o:Language.Futhark.Interpreter
Should I listen to the error or is there something else at play.
You should listen to the error. It is a compiler bug, as the errors explicitly tells you. My guess is a corner case in the interpreter.
@athas I think I am a bit stuck, and wondered if you could help me out.
I am able to construct a table like the one in figure 5 [1] which gives the transitions taken in the DFA.
An example of doing so is creating the S-expression grammar cabal run alpacc -- grammars/sexp.cg
and entering the REPL with.
[0]> "test" |> parser.transitions |> parser.solve_transitions
[[(0, 0), (1, 0), (2, 1), (3, 1), (4, 4)],
[(0, 0), (1, 0), (2, 1), (3, 1), (4, 4)],
[(0, 0), (1, 0), (2, 1), (3, 1), (4, 4)],
[(0, 0), (1, 0), (2, 1), (3, 1), (4, 4)]]
The problem is I would want to reduce this to find the DFA path they specify in figure 6 [1]. So I have defined an initial state.
def initial_state : state_type = 3
Which I would then use to find the path taken [(3, 1), (1, 0), (0, 0), (0, 0)]
. The problem here is I am not sure how this is done in log(n) parallel time.
Also the paper mention extensions to the algorithm for tokenizing and I am not quite sure if I understand the idea. I think I want to create a table which maps a transition (state_type, state_type)
to a set of terminals? And then somehow solve for which of the overlapping terminals are prioritized.
[1] Hill, J.M.D., Parallel lexical analysis and parsing on the AMT distributed array processor, Parallel Computing 18 (1992) 699-714.
I'll read the paper and get back to you, but it might take a few days.
I'll read the paper and get back to you, but it might take a few days.
Awesome thank you! I have plenty of time.
I'll read the paper and get back to you, but it might take a few days.
Btw about this problem, what I have done for now is use a sequential version.
def path [n] (trans: [n][G.transitions_size]transition_type) : [n]transition_type =
(loop result = (replicate n (-1, -1), G.initial_state) for i < n do
let old_arr = result.0
let old_st = result.1
let new_arr = old_arr with [i] = trans[i][old_st]
let new_st = new_arr[i].1
in (new_arr, new_st)
) |> (.0)
Also I think it is getting really close to be working, but it probably isn't that close yet since I have not set up tests yet.
I think the last part is finding a way to group the token indexes together correctly.
For the sexp.cg
grammar the lexer currently outputs.
[1]> parser.lexer "((test))"
[3, 3, 2, 2, 2, 2, 4, 4]
But it should be.
[1]> parser.lexer "((test))"
[3, 3, 2, 4, 4]
I am slightly worried about that function, as it is sequentially applying transitions. To my knowledge, the usual way to parallelise such a thing is to represent map each input element as a transition function, and then do a reduction with function composition. This can be effective if you can find an efficient way to represent the transition function (it obviously cannot be a Futhark-level function). I used that technique in last year's Advent of Code: https://futhark-lang.org/blog/2022-12-10-case-study.html
However, I don't offhand see how the transition function won't end up being too large to be efficient.
Strange I thought it would be easier, since they do not really elaborate on how to reduce it in the paper.
I think I have found a solution. I think it uses the ideas you mentioned. I add in a vector of initial states in the beginning of the table. Then I do a scan using function composition on the vectors of transitions. I believe this will always result in each row in the table becoming the same path which is the one that starts at the initial state.
I think I have the parallel lexing down now, it looks like I can produce the correct sequence of terminals using the lexer
function. I should just also make it so it finds the substring span of each terminal. And I also need to find a good way of testing it, maybe property based testing is an idea again.
@athas can you give me some input on a design choice? I would want to filter out some tokens like the spaces in the case of the S-expression grammar. How should this be done in the .cg
file? Or is there a better idea then doing this.
A simple solution would be to say that a terminal named whitespace
is always removed before parsing.
I have run into something weird I think.
[34]> let x = parser.lexer "uuv" |> from_maybe ([],[]) |> (.0)
[35]> :t x
[0]u32
[36]> x
[2, 2, 3]
I will try and investigate this when I have more energy. source: parser_llp11_0.zip
@athas I think at this stage the lexer generator works. I have some tests in futhark-tests/lexer-tests
with some overlapping patterns tests that passes. I have added a flag --lexer
which generates only the lexer with a lex
entry which return a [n][3]u64
array where the first index in each [3]u64 array is the token type, the second is the index where the token starts and the third is the index where the token ends.
Is this implementation what you would expect from a lexer generator? And also when I solve for a path through the DFA I do a scan on a 2D array which is of size (number of states) * (string input size). I have chosen to make this 2D array as small as possible by minimizing the number of DFA states. This results in the need for marking every edge by the set of tokens it could belong to. Using this I do a intersection scan and find the most prioritized token in the sets (I think this is the proper way of solving it but I should probably ask Torben). As an alternative I could just not minimize the DFA and not do the intersection scan. I was wondering if you think this minimization of the DFA is worth it? Since I know the DFA can be exponentially large but I do not know how likely it is.
Nevermind I have just realized I have not tested this well enough since probably some of binary operations I use in the scans are not associative.
Make sure to also check whether the neutral elements are actually neutral. It's a very easy mistake to make.
When using the OpenCL backend I can get the tests to pass now. I believe the problem was when I was doing the intersection scan the operation was not associative. Now I just map a intersection reduce over the token spans, so in best case it has O(1) and in worst case O(log n) parallel time complexity. Also I do not know if this Futhark code is any good.
It is generally good practice to move complexity into a map
you do before the scan
or reduce
. An operator with complex control flow is usually not associative (although exceptions exist).
On large input (e.g. maybe 100MiB of input), do you see any speedup with the opencl
backend compared to the c
backend?
@athas It does make a difference but I hope it i as good enough difference. Using the C backend the real time was 1m31.8s on a Ryzen 5 1600X and when using the OpenCL backend the real time was 0m18.9s using a GTX 3060.
I used this little script to generate some random s-expression tokens and used your txt2fut.py
script to convert it.
import random
import string
def main():
result = ''
while len(result) < 100 * 2**20:
i = random.randint(0, 3)
if i == 0:
result += ' ' * random.randint(1, 5)
elif i == 1:
result += ''.join(random.choices(string.ascii_uppercase +
string.ascii_lowercase, k=random.randint(1, 20)))
elif i == 2:
result += ')'
elif i == 3:
result += '('
with open('code.lisp', 'a') as fp:
fp.write(result)
if __name__ == '__main__':
main()
Also the --lexer
flag does not work at the moment for some reason.
Try with the cuda
backend. It has much faster scans.
Try with the
cuda
backend. It has much faster scans.
With the CUDA backend I got 0m14.3s, I am making a 1GiB file and i should probably be using futhark bench
instead of time(0)
.
Yes, although at those runtimes I'm not sure the difference is significant.
I can not test the 1GiB file because it would use too much memory. I think I will have to find a way to save some memory when I construct the (number of states) * (string input size) 2D array.
The main cost centre by far is the scan with compose_transition_vectors_identity
. Scans generally work badly when the operator is complicated, and particularly when the element type is an array. This one almost works because an operator that is just a map2
is generally quite efficient, but then you have that #nothing
-based identity element. If you could figure out a way to encode the identity element such that the operator is a map2
, that would likely be a lot faster.
Alternatively, using tuples instead of arrays might help. Although with large tuples you may run into other problems. It's difficult to predict what would be best. Perhaps the builtin scan
should not be used, and instead a hand-rolled work efficient scan should be employed. Those are normally less efficient, but can be better when the operator contains parallelism.
I think I understand the ideas kinda, will look into it on Monday maybe. Also if you want to and have time for it, feel free to push changes to this branch! 😃
I believe there is no natural occurring identity for combine_transitions
but I think, I could simplify the combine
function which is used in combine_transitions
to.
type transition = state
def combine (_ : transition) (a : transition) : transition =
a
Since in each transition
x -> y
the x
is actually just the index, this helps on memory usage.
Something I am questioning if it is an optimization is moving the identity inside the combine
function in combine_transitions
?
def combine_transitions [n] (a : [n](maybe transition)) (b : [n](maybe transition)) : [n](maybe transition) =
map2 (add_identity combine) a b
It is possible to change compose_transition_vectors_identity
such that it does not need to use the boolean which specifies if its the final transition. The change would make it so the function would only use function composition so a neutral element should be able to be found which is just (0..L.number_of_states)
. This results in the regular expression being changed. Currently the regular expression used for sexp.cg
as an example is \(|\)|[a-z]+|(\s|\n|\t)
here every time a final states is reached in the DFA (the end boolean) then the DFA walk is restarted. The change done to the regular expression will be (\(|\)|[a-z]+|(\s|\n|\t))*
and then the accepting states should be solved for after the walk. I have had implemented something like this before, it worked every time I did some small tests but I do not know if it actually works.
Yes, that would be better. If at all possible, reduction and scan operators should be on scalars (or tuples of scalars), or of the form map2 f
where f
is then on scalars or tuples of scalars.
I simplified the scans quite a bit but it only resulted in 2 seconds speed up using the CUDA backend. I am guessing compose_transitions_vectors
should probably be using tuples?
Current compose_transition_vectors
implementation.
def compose_transition_vectors (a : state_vector) (b : state_vector) : state_vector =
map (\s ->
b[state_module.to_i64 s]
) a
I simplified the scans quite a bit but it only resulted in 2 seconds speed up using the CUDA backend. I am guessing
compose_transitions_vectors
should probably be using tuples? Currentcompose_transition_vectors
implementation.def compose_transition_vectors (a : state_vector) (b : state_vector) : state_vector = map (\s -> b[state_module.to_i64 s] ) a
I have tried this now. It did not result in betters speeds but it got a second slower it seems.
If I run it a few times and pick the best time with the CUDA backend on the 100MiB dataset I get some better times, so it is might be closer to a 3-4 seconds improvement. The best time I could get.
real 0m10.286s
user 0m8.530s
sys 0m1.320s
I think some of the maps could be put together to one function that might also help.
Use futhark bench
for benchmarking. Don't roll your own.
I think a memory optimization I could do is packing the vector of states into a single u64
since each state can most likely be expressed using a u8
I think.
Use
futhark bench
for benchmarking. Don't roll your own.
Last time I tried to use this, it took quite a long time, would it be fine to down size the dataset?
Sure, but it takes a long time because it tries to do a good job.
Sure, but it takes a long time because it tries to do a good job.
Now I feel bad for being an impatient little prick.
I think my code needs to be more memory efficient since sometimes I get an out of memory error when running the code. I have 12GiB of VRAM so I am quite sure my code is bad. Also when I try to run futhark bench
the benchmark is killed too, I have also tried doing the benchmark on a Titan RTX now and the benchmark is killed there too. slurm says
some of your processes may have been killed by the cgroup out-of-memory handler.
So I am guessing the benchmark dies here also because of the lack of memory.
Also the A40 deals with it as well as the Titan RTX.
Nevermind I am being dumb, the benchmark being killed is related to RAM usage... I did use to get a CUDA error saying something about it not being able to allocate memory on "device" so I thought the error was related to VRAM.
Some dumb mistakes later... here are the CUDA benchmark results @athas .
Compiling sexp.fut...
Reporting arithmetic mean runtime of at least 10 runs for each dataset (min 0.5s).
More runs automatically performed for up to 300s to ensure accurate measurement.
sexp.fut (no tuning file):
large.in: 84623μs (95% CI: [ 84544.7, 84697.6])
Edit 0:
Wait or maybe this is wrong, do I have to specify the output? Also this was done on a A40.
Edit 1:
Arh even if I specify the output it does not make a difference. It is surprisingly fast, I get why I should not use time(0)
.
Using the 1GiB dataset on a A40 I get.
Compiling sexp.fut...
Reporting arithmetic mean runtime of at least 10 runs for each dataset (min 0.5s).
More runs automatically performed for up to 300s to ensure accurate measurement.
sexp.fut (no tuning file):
huge.in: 3061740μs (95% CI: [ 3059687.4, 3064417.4])
I think I have some decent lexer tests currently but I guess they could be better.
How big is huge.in
?
How big is
huge.in
?
It should be a little over 1GiB
So 300MiB/s? I must admit I have little intuition for whether that is fast, as far as parsing goes. Probably not, but there is still work to do in speeding this up.
So 300MiB/s? I must admit I have little intuition for whether that is fast, as far as parsing goes. Probably not, but there is still work to do in speeding this up.
It probably is not that great if the logos benchmarks are comparable.
I lost the original dataset (which is not using the futhark data format) so i generated a new 1GiB file. I used this to compare it with the logos Rust lexer library which took 4562053μs
to do the lexing. I made sure to use optimization level 3 when making the release build for the test.
I am no Rust expert but I hope this is a fair comparison.
use std::fmt;
use std::io::Read;
use logos::Logos;
use std::fs::File;
use std::time::SystemTime;
#[derive(Logos, Debug, PartialEq)]
#[logos(skip r"[ \t\n]+")] // Ignore this regex pattern between tokens
enum Token {
#[regex("[a-zA-Z]+")]
Atom,
#[token("(")]
LParen,
#[token(")")]
RParen,
}
impl fmt::Display for Token {
fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
let token: &str = match self {
Token::Atom => { "Atom" }
Token::LParen => { "LParen" }
Token::RParen => { "RParen" }
};
write!(f, "{}", token)
}
}
fn main() {
let mut file = File::open("raw-huge.in").unwrap();
let mut data = vec![];
let _ = file.read_to_end(&mut data);
let s = String::from_utf8(data).unwrap();
let mut lex = Token::lexer(&s);
let mut result = Vec::new();
let now = SystemTime::now();
let mut step = lex.next();
while step != None {
match step.unwrap() {
Ok(token) => {
result.push((token, lex.span()));
}
Err(e) => {
println!("Error: {e:?}");
break;
}
}
step = lex.next()
}
// println!("{:?}", result);
match now.elapsed() {
Ok(elapsed) => {
println!("Time elapsed: {}μs", elapsed.as_micros());
}
Err(e) => {
println!("Error: {e:?}");
}
}
}
In my glorious adventure of trying to optimize my code I found some error while changing the lex
function.
Compiling sexp.fut...
Reporting arithmetic mean runtime of at least 10 runs for each dataset (min 0.5s).
More runs automatically performed for up to 300s to ensure accurate measurement.
sexp.fut:lex (no tuning file):
huge.in:
Unknown error. This is a compiler bug.
Failed to run sexp.fut
./sexp: sexp.c:8031: CUDA call
gpu_free_all(ctx)
failed with error code 1 (invalid argument)
CallStack (from HasCallStack):
error, called at src/Futhark/Server.hs:187:7 in futhark-server-1.2.2.1-6bIJRa1PKd9DOkMNzFraOf:Futhark.Server
No such problem happens with the OpenCL backend currently.
The OpenCL benchmark:
huge.in: 2641877μs (95% CI: [ 2576615.3, 2700206.8])
It is about 0.4s faster so far.
Also I can try to narrow down what code causes the bug on Saturday maybe.
I also get this error.
Compiling sexp.fut...
Reporting arithmetic mean runtime of at least 10 runs for each dataset (min 0.5s).
More runs automatically performed for up to 300s to ensure accurate measurement.
sexp.fut:lex (no tuning file):
large.in:
sexp.c:8084: CUDA call
cuMemcpyDtoH_v2(dst, src + offset, size)
failed with error code 700 (an illegal memory access was encountered)
Failed to run sexp.fut
./sexp: sexp.c:8031: CUDA call
gpu_free_all(ctx)
failed with error code 1 (invalid argument)
CallStack (from HasCallStack):
error, called at src/Futhark/Server.hs:187:7 in futhark-server-1.2.2.1-6bIJRa1PKd9DOkMNzFraOf:Futhark.Server
Currently only the RegEx can be parsed, the NFA will be needed to be constructed next.