flowr-analysis / flowr

A program slicer and dataflow analyzer for the R programming language.
https://github.com/flowr-analysis/flowr/wiki
GNU General Public License v3.0
19 stars 2 forks source link

Drop xmlparsedata and replace it with getParseData #653

Closed EagleoutIce closed 7 months ago

EagleoutIce commented 7 months ago

Like the r languageserver we originally used xmlparsedata to retrieve the AST roughly like this:

xmlparsedata::xml_parse_data(parse(text="...."), pretty=FALSE)

Yet, this requires 1) xmlparsedata to be present, 2) taking care of xmlparsedata::xml_parse_token_map to revert the automatic replacements.

We know want a change to the following:

getParseData(parse(text="2 * x"))

Using write.table like this:

write.table(getParseData(parse(text="2 * x")),sep=",",col.names=TRUE)

This produces something akin to this:

"line1","col1","line2","col2","id","parent","token","terminal","text"
"7",1,1,1,5,7,0,"expr",FALSE,""
"1",1,1,1,1,1,2,"NUM_CONST",TRUE,"2"
"2",1,1,1,1,2,7,"expr",FALSE,""
"3",1,3,1,3,3,7,"'*'",TRUE,"*"
"4",1,5,1,5,4,6,"SYMBOL",TRUE,"x"
"6",1,5,1,5,6,7,"expr",FALSE,""

[!IMPORTANT] Look at "line1" which is wrongly mapped to "7" the id. For currently unknown reasons the name for the first column (header) is missing :c

Using this, you can use a function like parseCSV (already in use for the tokenMap retrieval) to retrieve the AST table within Typescript.

When creating the tree, however, please keep in mind that you have to handle the ordering of tokens based on their location, not on the order within the table. To speed things up you may use something similar to xmlparsedata's start and end tags (they appear to work fine).

Please keep the ids alive so that we

  1. obtain a mapping from R's tokens to what we normalize
  2. can replace the location in normalized tokens with a set of ids to these R tokens (reducing the memory footprint of trees significantly)

Furthermore, we may consider on optimizing some things during the re-parsing already (we could, e.g., drop expr layers, map multiplications etc.) but this may be of interest for future issues (important issues :3).

Please implement it so that we can use the new behavior with the RShell and the RShellExecutor.

EagleoutIce commented 7 months ago

Maybe a speed comparison with the old xmlparsedata would be interesting ;) (+ we should include unicode tests from the failed MSR suite).

EagleoutIce commented 7 months ago

Additionally we can kill the lexeme as it is linked to the R-id as well!