brave / brave-browser

Brave browser for Android, iOS, Linux, macOS, Windows.
https://brave.com
Mozilla Public License 2.0
17.74k stars 2.32k forks source link

Add PageGraph Structure Documentation #41343

Open FHantke opened 2 weeks ago

FHantke commented 2 weeks ago

Platforms

all

Description

Hi, not sure where to put this so I selected a "Feature request". I added some documentation about the pagegraph structure to the wiki which, I believe, is very helpful for someone analyzing the pagegraph for the first time. The documentation might be incomplete and miss new added features of the pagegraph, but can deal as a good initial introduction.

https://github.com/FHantke/brave-browser-wiki/blob/main/PageGraph-Structure.md

CC @pes10k

pes10k commented 5 days ago

Hi @FHantke ,

Just wanted to echo the conversation we had on Signal here. First, thank you for preparing all this! Second though, Im a little unsure about this, bc I dont want to make any promises that the graph structure will be fixed across versions, and (more importantly) if folks start trying to deal with the graphs directly, they're going to make a lot of mistakes.

I'd rather encourage folks to use something like pagegraph-query which handles a bunch of the correct-but-unexpected behaviors that browsers do, but which will confuse a lot of efforts to parse the graphs directly.

However, i think the documentation work you've done is terrific, and important to add and get committed. Instead of committing it as independent documentation of the graph though, what do you think of adding it as documentation of the PageGraph C++ code then? There will be one-to-one mappings between PageGraph edge and node types and C++ classes in brave-core, and PageGraph edge and node attributes with C++ properties in brave-core.

For example, HTML Element nodes in PageGraph graphs are implemented with the brave_page_graph::NodeHTMLElement class, and the script id attributes for Script nodes in the GraphML are controlled by brave_page_graph::NodeScript::script_id_ properties.

Anyway, so suggestion is to move the documentation you've written up in to the C++ source, to make it clear that the implementation is being documented, and folks who want to query the graphs would be better served by using a higher level library that abstracts away the graph's complexities. Curious what you think?