Clean dummy data and make it usable for the other issues as well

Bubblbu commented 2 years ago

Create a dummy data set that contains all required data as well as metadata to describe citation data and its provenance.

Bubblbu commented 2 years ago

Citation Data

Citation data typically consists of tables (typically CSVs) that can describe two broad forms of data: (1) the direct outputs from indexing processes which we call traces or (2) and processed (usually aggregate) form of those traces which we call metrics for now.

Traces

Mentions (aka citation statements or citations-in-context)

Some data providers parse the full texts of the content to identify statements that cite other documents. This form of citation data is often called citations-in-context (I know... great job on my side picking the exact same name for a different meaning), citation statements, or mentions.

+----+--------+--------+--------------+
| id | source | target |   context    |
+----+--------+--------+--------------+
|  1 | doi1   | doi2   | introduction |
|  2 | doi1   | doi2   | introduction |
|  3 | doi1   | doi2   | methods      |
|  3 | doi1   | doi3   | discussion      |
|  3 | doi2   | doi3   | conclusion      |
+----+--------+--------+--------------+

References (aka citations)

The most common level of granularity for the raw data indexed by data providers is at the level of references which simply reflect the fact that an article cited another one. Most often the actual articles are not processed for this data, instead, reference lists and standardized descriptions of bibliographic information are provided by publishers.

+----+--------+--------+
| id | source | target |
+----+--------+--------+
|  1 | doi1   | doi2   |
|  2 | doi1   | doi3   |
|  3 | doi2   | doi3   |
+----+--------+--------+

Metrics

Aggregate counts (aka citation counts)

Both forms of traces (mentions and references) can be aggregated at various levels to produce summarized statements about an article, author, journal, institution, or even countries. These aggregations rely on the disambiguation and identification of those entities.

By combining different traces and processing methods we can now produce metrics at the same level of aggregation to compare entities and individuals.

+---------+----------+------------+
| article | mentions | references |
+---------+----------+------------+
| doi1    |        0 |          0 |
| doi2    |        3 |          1 |
| doi3    |        2 |          2 |
+---------+----------+------------+

Bubblbu commented 2 years ago

Metadata

For each type of citation data (traces and metrics) we will require a different type of metadata schema to be able to describe each column that is used.

Traces

Trace metadata describes the indexing process including the indexed content and methods of extraction.

name:
description:
citation_index_profile:

Metrics

Metric metadata schemas will get as complicated as the metric is that is being captured with the very possible cases of no information about how a metric was processed. On a high-level the schema is attempting to capture the various processes that were applied to a trace to create the final metric. These processes can be classified as processes of aggregation, normalization, filtering, and classification.

aggregation: pivots the table
filter: reduces number of rows
classification: adds column
normalization: changes values

- name
- description
- provenance
  - function
  -

Bubblbu commented 2 years ago

Citation Index Profiles

Finally, citation index profiles are collections of metadata schema that are identified for a single data provider. Typically, each data provider will offer one kind of trace and a multitude of metrics but it also happens that a data provider offers multiple traces.

Bubblbu / frictionless-metrics-in-context