apache / incubator-graphar

An open source, standard data file format for graph data storage and retrieval.
https://graphar.apache.org/
Apache License 2.0
213 stars 46 forks source link

feat: Check the metadata and the file #548

Open jasinliu opened 1 month ago

jasinliu commented 1 month ago

Describe the enhancement requested

Currently, our metadata and storage files are separate, and the metadata can be modified. This provides great convenience, but it will be very troublesome if an error occurs.

We need to provide a checking tool to check whether the metadata are valid and the consistency between the storage files and the metadata.

Component(s)

Format, Other

SemyonSinchenko commented 1 month ago

What do you think about making such a tool as a part of the planned GraphAr CLI?

https://github.com/apache/incubator-graphar/issues/463

jasinliu commented 1 month ago

What do you think about making such a tool as a part of the planned GraphAr CLI?

463

Yes, please let me fix this.

yecol commented 1 month ago

Hi @jasinliu, I think this question does not come from the separate placement of metadata and files. The situation you mentioned, by only change the metadata is allowed and even by design:

e.g., a user has a graph G with edges labeled A/B/C, and vertices labeled D/E. He/She can easily generate a G' with edges labeled A and vertices labeled D, by only copying/modifying the metadata M'.

Hence, for the validation tool, I suggest it may not check the pairing unmodified. But to validate these:

jasinliu commented 1 month ago

Hi @jasinliu, I think this question does not come from the separate placement of metadata and files. The situation you mentioned, by only change the metadata is allowed and even by design:

e.g., a user has a graph G with edges labeled A/B/C, and vertices labeled D/E. He/She can easily generate a G' with edges labeled A and vertices labeled D, by only copying/modifying the metadata M'.

Hence, for the validation tool, I suggest it may not check the pairing unmodified. But to validate these:

  • the modified metadata is self-valid. e.g., in the example above, in the M' for G', the edges A are connecting ONLY the vertices labeled D, otherwise the G' lack vertices.
  • For the storage files, I suggest in the metadata should record each file location and its digest/MD5, to ensure there is no modification since last archive. When loading, check the MD5 to ensure the payload of the data is what you intend to read.

Thank you very much, this is a very good suggestion. This suggestion provides such an idea that one storage file can correspond to multiple different graph data.