Add file level metadata extraction

yarikoptic commented 1 year ago

One of the possible use cases would be to establish/feed the datalad-catalog as outlined in https://github.com/datalad/datalad-registry/issues/123#issuecomment-1462470777 . For that we should start with metalad_core file level extraction. But I think we might need some guidance or even functionality at metalad level to make it feasible in reasonable time. One of the relevant issues with a comment there is https://github.com/datalad/datalad-metalad/issues/379 and also inquiring now from @christian-monch on Element.

yarikoptic commented 1 year ago

Here is the issue which describes it https://github.com/datalad/datalad-metalad/issues/395 . It would be nice to first run across a good sample datasets to assess how large those dumps would be: would postgresql scale up to store that much of BJSONs?

candleindark commented 1 year ago

I know that JSONB is the best type for storing JSON data in Postgres in many aspects.

As for how well it scales, below is an answer from ChatGPT.

The JSONB data type in PostgreSQL is highly flexible and efficient for storing JSON-formatted data. Below are some points that discuss how well it scales:

Storage

Binary Format: JSONB is stored in a binary format, which allows for more efficient storage and retrieval but uses slightly more storage space than plain text JSON.

Query Performance

Indexing: One of the main advantages of JSONB is its support for specialized JSON indexing, through GIN (Generalized Inverted Index) or GiST (Generalized Search Tree), which can dramatically improve query performance.

Nested Queries: If your application requires complex nested queries within JSON structures, JSONB can offer a performance advantage.

Write Performance

Overhead: JSONB incurs a small overhead during the insert or update operations as it parses and rewrites the JSON data into a binary format.

Space Utilization

Compression: JSONB data is more efficiently compressed than plain JSON or text data types, but the difference is usually marginal.

Search Operations

Advanced Operators: JSONB supports complex query expressions and operators that are not available in the JSON type, providing more flexibility and power.

Scalability

Partitioning: PostgreSQL supports table partitioning, and you can partition JSONB columns as well. This can help in scaling the database horizontally.

Read Replicas: For read-heavy workloads, you can use read replicas to distribute the read operations and improve performance.

Compatibility and Interoperability

Compatibility with Other Data Types: JSONB can be easily cast to other data types and manipulated using a range of built-in PostgreSQL functions.

Recommendations:

Caching: If your JSONB fields are often queried but rarely updated, consider caching the results to avoid hitting the database repeatedly.

Selective Indexing: Use indexes judiciously. Create them only for the keys that are frequently used in queries.

Batch Processing: If possible, batch insert or update operations to reduce the per-operation overhead of JSONB conversion.

Examine Query Plans: Use tools like EXPLAIN to understand how PostgreSQL is handling your JSONB queries and optimize accordingly.

Here are some references for deeper understanding:

PostgreSQL JSONB documentation

JSONB Indexing

EXPLAIN command in PostgreSQL

yarikoptic commented 1 year ago

Scalability
Partitioning: PostgreSQL supports table partitioning, and you can partition JSONB columns as well. This can help in scaling the database horizontally.

sounds relevant.

I guess we will see how well/bad it works after we fill up this one with some initial dumps of per-file. And then we might want to look into partitioning etc.

candleindark commented 1 year ago

I think JSONB is the best option for JSON data, for our use, from Postgres. If you want anything better, we will have to look into NoSQL databases, CouchDB, Elasticsearch, etc. Many applications have both relational and NoSQL databases.

yarikoptic commented 11 months ago

ok, Let's plan to tackle this one soonish after we merge #257 . For that one we will enable dandi:files and also we would need to create a "pipeline" to extract e.g. metalad_core:files as described in https://github.com/datalad/datalad-metalad/issues/395 -- start with just that single metalad_core extractor.

Later, after we see how badly this does not scale for anything (e.g. search, datalad catalog etc) -- we might want to establish a dedicated table with per-dataset/path metadata records.

datalad / datalad-registry

Add file level metadata extraction #244

Storage

Query Performance

Write Performance

Space Utilization

Search Operations

Scalability

Compatibility and Interoperability

Recommendations: