AnacletoLAB / grape

🍇 GRAPE is a Rust/Python Graph Representation Learning library for Predictions and Evaluations
MIT License
539 stars 38 forks source link

Grape documentation #31

Open fxd24 opened 1 year ago

fxd24 commented 1 year ago

Hi, I've been exploring this repository to work on graphs. The report in text format of the graphs is very useful, anyhow I find it difficult to find the metadata of the graph, except the weight. The documentation also does not make it clear, because it provides basically a description of the function name and the data types, but not what these data types represent and what is the intuition behind.

One example is a when a Graph is created with the parameter additional_graph_kwargs. Where do I find which kwargs I can set?

There are other examples that are not only regarding the dataset, but graph processing in general. It is not clear what the intuition behind certain functions is.

Is there any plan on improving the documentation so that it is more likely that users use your framework?

LucaCappelletti94 commented 1 year ago

Hello @fxd24, when loading a graph, you can use the Graph.from_csv, which is documented in this tutorial.

The documentation website is admittedly behind the documentation integrated within the package, as it is very hard to create automatically documentation for packages based on Rust bindings such as this one. I suggest you use the documentation integrated within the package, as explained below.

If you have doubts about the integrated documentation, please just point them out, and I will iterate on them to try and make it more straightforward.

Could you clarify what do you mean by I find it difficult to find the metadata of the graph, except the weight.?

If you are referring to the additional kwargs that can be provided to the retrievable datasets, those are the same as the from CSV. I cannot actually document them in the docstring because any additional character I add in those docstrings is multiplied by the number of graphs, leading to a considerable increase in wheel size.

For any question, I am available on the discord server for a chat.

For all methods, if you run:

from grape import Graph
help(Graph.from_csv)

you will get:

Help on built-in function from_csv:

from_csv(directed, node_type_path, node_type_list_separator, node_types_column_number, node_types_column, node_types_ids_column_number, node_types_ids_column, node_types_number, numeric_node_type_ids, minimum_node_type_id, node_type_list_header, node_type_list_support_balanced_quotes, node_type_list_rows_to_skip, node_type_list_is_correct, node_type_list_max_rows_number, node_type_list_comment_symbol, load_node_type_list_in_parallel, node_path, node_list_separator, node_list_header, node_list_support_balanced_quotes, node_list_rows_to_skip, node_list_is_correct, node_list_max_rows_number, node_list_comment_symbol, default_node_type, nodes_column_number, nodes_column, node_types_separator, node_list_node_types_column_number, node_list_node_types_column, node_ids_column, node_ids_column_number, nodes_number, minimum_node_id, numeric_node_ids, node_list_numeric_node_type_ids, skip_node_types_if_unavailable, load_node_list_in_parallel, edge_type_path, edge_types_column_number, edge_types_column, edge_types_ids_column_number, edge_types_ids_column, edge_types_number, numeric_edge_type_ids, minimum_edge_type_id, edge_type_list_separator, edge_type_list_header, edge_type_list_support_balanced_quotes, edge_type_list_rows_to_skip, edge_type_list_is_correct, edge_type_list_max_rows_number, edge_type_list_comment_symbol, load_edge_type_list_in_parallel, edge_path, edge_list_separator, edge_list_header, edge_list_support_balanced_quotes, edge_list_rows_to_skip, sources_column_number, sources_column, destinations_column_number, destinations_column, edge_list_edge_types_column_number, edge_list_edge_types_column, default_edge_type, weights_column_number, weights_column, default_weight, edge_ids_column, edge_ids_column_number, edge_list_numeric_edge_type_ids, edge_list_numeric_node_ids, skip_weights_if_unavailable, skip_edge_types_if_unavailable, edge_list_is_complete, edge_list_may_contain_duplicates, edge_list_is_sorted, edge_list_is_correct, edge_list_max_rows_number, edge_list_comment_symbol, edges_number, load_edge_list_in_parallel, verbose, may_have_singletons, may_have_singleton_with_selfloops, name)
    Return graph renderized from given CSVs or TSVs-like files.

    Parameters
    ----------
    node_type_path: Optional[str]
        The path to the file with the unique node type names.
    node_type_list_separator: Optional[str]
        The separator to use for the node types file. Note that if this is not provided, one will be automatically detected among the following`: comma, semi-column, tab and space.
    node_types_column_number: Optional[int]
        The number of the column of the node types file from where to load the node types.
    node_types_column: Optional[str]
        The name of the column of the node types file from where to load the node types.
    node_types_number: Optional[int]
        The number of the unique node types. This will be used in order to allocate the correct size for the data structure.
    numeric_node_type_ids: Optional[bool]
        Whether the node type names should be loaded as numeric values, i.e. casted from string to a numeric representation.
    minimum_node_type_id: Optional[int]
        The minimum node type ID to be used when using numeric node type IDs.
    node_type_list_header: Optional[bool]
        Whether the node type file has an header.
    node_type_list_support_balanced_quotes: Optional[bool]
        Whether to support balanced quotes.
    node_type_list_rows_to_skip: Optional[int]
        The number of lines to skip in the node types file`: the header is already skipped if it has been specified that the file has an header.
    node_type_list_is_correct: Optional[bool]
        Whether the node types file can be assumed to be correct, i.e. does not have something wrong in it. If this parameter is passed as true on a malformed file, the constructor will crash.
    node_type_list_max_rows_number: Optional[int]
        The maximum number of lines to be loaded from the node types file.
    node_type_list_comment_symbol: Optional[str]
        The comment symbol to skip lines in the node types file. Lines starting with this symbol will be skipped.
    load_node_type_list_in_parallel: Optional[bool]
        Whether to load the node type list in parallel. Note that when loading in parallel, the internal order of the node type IDs may result changed across different iterations. We are working to get this to be stable.
    node_path: Optional[str]
        The path to the file with the unique node names.
    node_list_separator: Optional[str]
        The separator to use for the nodes file. Note that if this is not provided, one will be automatically detected among the following`: comma, semi-column, tab and space.
    node_list_header: Optional[bool]
        Whether the nodes file has an header.
    node_list_support_balanced_quotes: Optional[bool]
        Whether to support balanced quotes.
    node_list_rows_to_skip: Optional[int]
        Number of rows to skip in the node list file.
    node_list_is_correct: Optional[bool]
        Whether the nodes file can be assumed to be correct, i.e. does not have something wrong in it. If this parameter is passed as true on a malformed file, the constructor will crash.
    node_list_max_rows_number: Optional[int]
        The maximum number of lines to be loaded from the nodes file.
    node_list_comment_symbol: Optional[str]
        The comment symbol to skip lines in the nodes file. Lines starting with this symbol will be skipped.
    default_node_type: Optional[str]
        The node type to be used when the node type for a given node in the node file is None.
    nodes_column_number: Optional[int]
        The number of the column of the node file from where to load the node names.
    nodes_column: Optional[str]
        The name of the column of the node file from where to load the node names.
    node_types_separator: Optional[str]
        The node types separator.
    node_list_node_types_column_number: Optional[int]
        The number of the column of the node file from where to load the node types.
    node_list_node_types_column: Optional[str]
        The name of the column of the node file from where to load the node types.
    node_ids_column: Optional[str]
        The name of the column of the node file from where to load the node IDs.
    node_ids_column_number: Optional[int]
        The number of the column of the node file from where to load the node IDs
    nodes_number: Optional[int]
        The expected number of nodes. Note that this must be the EXACT number of nodes in the graph.
    minimum_node_id: Optional[int]
        The minimum node ID to be used, when loading the node IDs as numerical.
    numeric_node_ids: Optional[bool]
        Whether to load the numeric node IDs as numeric.
    node_list_numeric_node_type_ids: Optional[bool]
        Whether to load the node types IDs in the node file to be numeric.
    skip_node_types_if_unavailable: Optional[bool]
        Whether to skip the node types without raising an error if these are unavailable.
    load_node_list_in_parallel: Optional[bool]
        Whether to load the node list in parallel. When loading in parallel, without node IDs, the nodes may not be loaded in a deterministic order.
    edge_type_path: Optional[str]
        The path to the file with the unique edge type names.
    edge_types_column_number: Optional[int]
        The number of the column of the edge types file from where to load the edge types.
    edge_types_column: Optional[str]
        The name of the column of the edge types file from where to load the edge types.
    edge_types_number: Optional[int]
        The number of the unique edge types. This will be used in order to allocate the correct size for the data structure.
    numeric_edge_type_ids: Optional[bool]
        Whether the edge type names should be loaded as numeric values, i.e. casted from string to a numeric representation.
    minimum_edge_type_id: Optional[int]
        The minimum edge type ID to be used when using numeric edge type IDs.
    edge_type_list_separator: Optional[str]
        The separator to use for the edge type list. Note that, if None is provided, one will be attempted to be detected automatically between ';', ',', tab or space.
    edge_type_list_header: Optional[bool]
        Whether the edge type file has an header.
    edge_type_list_support_balanced_quotes: Optional[bool]
        Whether to support balanced quotes while reading the edge type list.
    edge_type_list_rows_to_skip: Optional[int]
        Number of rows to skip in the edge type list file.
    edge_type_list_is_correct: Optional[bool]
        Whether the edge types file can be assumed to be correct, i.e. does not have something wrong in it. If this parameter is passed as true on a malformed file, the constructor will crash.
    edge_type_list_max_rows_number: Optional[int]
        The maximum number of lines to be loaded from the edge types file.
    edge_type_list_comment_symbol: Optional[str]
        The comment symbol to skip lines in the edge types file. Lines starting with this symbol will be skipped.
    load_edge_type_list_in_parallel: Optional[bool]
        Whether to load the edge type list in parallel. When loading in parallel, without edge type IDs, the edge types may not be loaded in a deterministic order.
    edge_path: Optional[str]
        The path to the file with the edge list.
    edge_list_separator: Optional[str]
        The separator to use for the edge list. Note that, if None is provided, one will be attempted to be detected automatically between ';', ',', tab or space.
    edge_list_header: Optional[bool]
        Whether the edges file has an header.
    edge_list_support_balanced_quotes: Optional[bool]
        Whether to support balanced quotes while reading the edge list.
    edge_list_rows_to_skip: Optional[int]
        Number of rows to skip in the edge list file.
    sources_column_number: Optional[int]
        The number of the column of the edges file from where to load the source nodes.
    sources_column: Optional[str]
        The name of the column of the edges file from where to load the source nodes.
    destinations_column_number: Optional[int]
        The number of the column of the edges file from where to load the destinaton nodes.
    destinations_column: Optional[str]
        The name of the column of the edges file from where to load the destinaton nodes.
    edge_list_edge_types_column_number: Optional[int]
        The number of the column of the edges file from where to load the edge types.
    edge_list_edge_types_column: Optional[str]
        The name of the column of the edges file from where to load the edge types.
    default_edge_type: Optional[str]
        The edge type to be used when the edge type for a given edge in the edge file is None.
    weights_column_number: Optional[int]
        The number of the column of the edges file from where to load the edge weights.
    weights_column: Optional[str]
        The name of the column of the edges file from where to load the edge weights.
    default_weight: Optional[float]
        The edge weight to be used when the edge weight for a given edge in the edge file is None.
    edge_ids_column: Optional[str]
        The name of the column of the edges file from where to load the edge IDs.
    edge_ids_column_number: Optional[int]
        The number of the column of the edges file from where to load the edge IDs.
    edge_list_numeric_edge_type_ids: Optional[bool]
        Whether to load the edge type IDs as numeric from the edge list.
    edge_list_numeric_node_ids: Optional[bool]
        Whether to load the edge node IDs as numeric from the edge list.
    skip_weights_if_unavailable: Optional[bool]
        Whether to skip the weights without raising an error if these are unavailable.
    skip_edge_types_if_unavailable: Optional[bool]
        Whether to skip the edge types without raising an error if these are unavailable.
    edge_list_is_complete: Optional[bool]
        Whether to consider the edge list as complete, i.e. the edges are presented in both directions when loading an undirected graph.
    edge_list_may_contain_duplicates: Optional[bool]
        Whether the edge list may contain duplicates. If the edge list surely DOES NOT contain duplicates, a validation step may be skipped. By default, it is assumed that the edge list may contain duplicates.
    edge_list_is_sorted: Optional[bool]
        Whether the edge list is sorted. Note that a sorted edge list has the minimal memory peak, but requires the nodes number and the edges number.
    edge_list_is_correct: Optional[bool]
        Whether the edges file can be assumed to be correct, i.e. does not have something wrong in it. If this parameter is passed as true on a malformed file, the constructor will crash.
    edge_list_max_rows_number: Optional[int]
        The maximum number of lines to be loaded from the edges file.
    edge_list_comment_symbol: Optional[str]
        The comment symbol to skip lines in the edges file. Lines starting with this symbol will be skipped.
    edges_number: Optional[int]
        The expected number of edges. Note that this must be the EXACT number of edges in the graph.
    load_edge_list_in_parallel: Optional[bool]
        Whether to load the edge list in parallel. Note that, if the edge IDs indices are not given, it is NOT possible to load a sorted edge list. Similarly, when loading in parallel, without edge IDs, the edges may not be loaded in a deterministic order.
    verbose: Optional[bool]
        Whether to show a loading bar while reading the files. Note that, if parallel loading is enabled, loading bars will not be showed because they are a synchronization bottleneck.
    may_have_singletons: Optional[bool]
        Whether the graph may be expected to have singleton nodes. If it is said that it surely DOES NOT have any, it will allow for some speedups and lower mempry peaks.
    may_have_singleton_with_selfloops: Optional[bool]
        Whether the graph may be expected to have singleton nodes with selfloops. If it is said that it surely DOES NOT have any, it will allow for some speedups and lower mempry peaks.
    directed: bool
        Whether to load the graph as directed or undirected.
    name: Optional[str]
        The name of the graph to be loaded.
fxd24 commented 1 year ago

Thanks @LucaCappelletti94 for the fast response, this is already helpful information.

Let me rephrase/change the question. Given a graph with properties on nodes and edges, I would like to explore what the properties are and what they mean (what their encoding is). Can I get such properties? I had the impression (from functions like this) that in GRAPE you can have metadata. What is this metadata?

LucaCappelletti94 commented 1 year ago

Yes, nodes can have multiple labels and multigraphs are supported. There are many methods to do queries on these properties. I am not sure what you mean, if you try graph_instance.node_type, you'll get several methods regarding node types and the same for graph_instance.edge_type.

Are you using a graph loading from CSV or the datasets?

Could you specify more in detail what the specific issue in question is? As said before, I am on Discord if you'd prefer to clarify the issue better.