DeNepo / corpus-analysis-notes

Other
0 stars 0 forks source link

Different artifacts #3

Open lpmi-13 opened 2 years ago

lpmi-13 commented 2 years ago

We probably want to be able to deterministically output multiple views over the raw data, rather than mutating it, so that we don't need to download an entirely new copy every time we run analysis.

With this in mind, some of the outputs (whether flat files locally or database entries), could be:

colevandersWands commented 2 years ago

the counting-things PR i'll send today already doesn't mutate the data, it creates a report of how many times different things show up in each directory and each file. So kind of like a lookup table.

since the report includes absolutes paths you can search it for files that include the AST nodes (or comments, or directives, ...) that you're interested in, then access the file using the path, then do any sort of extraction or analysis you like

is this the sort of thing you had in mind? everything you mention up there is already counted and referenced, but it doesn't count structural things like nesting yet

lpmi-13 commented 2 years ago

the above list was mostly just to capture some ideas, I'll get on that PR soon(ish)!