Modules roadmap data onboarding until subset creation

MariellaCC commented 9 months ago

@lorellav @caro401

Here is a proposition for modules list, for the first steps until subset creation.

This list is based on the work done in the following repos: https://github.com/DHARPA-Project/TopicModelling- and https://github.com/DHARPA-Project/kiara_plugin.dh_benelux_2023.

Please let me know if you have comments/questions and/or feedback.

Pipeline step	Module name	Module scope	Foreseen Inputs	Foressen Outputs	Code example/ as seen as in..	Planned for Jupyter notebook version	past UI suggestions/discussions
Data onboarding	get zenodo text files	Get a zip file from zenodo, check for text files within the zip, and create a table containing two columns: file names, and file contents. Append two columns containing number of words and number of chars.	string: doi (example: 4596345), string: file name (example: "ChroniclItaly_3.0_original.zip”)	table	this JDH article imports directly from zenodo: https://journalofdigitalhistory.org/en/article/WBqfZzfi7nHK?idx=76&layer=hermeneutics&lh=674&pidx=76&pl=narrative&y=158	1	Preview of the table, maybe in the future, if/when possible: visualization of the distribution of characters/words distribution across documents
Data onboarding	get url text files	Get files from direct link	string: url such as kiara.examples-main/examples/workshops/dh_benelux_2023/data	table	module download.file_bundle from https://github.com/DHARPA-Project/kiara_plugin.tabular	1	Preview of the table, maybe in the future, if/when possible: visualization of the distribution of characters/words distribution across documents
Corpus table preparation	get lccn metadata	get metadata from strings that comply with LCCN pattern: '/sn86069873/1900-01-05/' to get the publication ids and the dates and add that informations as two new columns. Optionally, publication names can be mapped by users.	table: The corpus for which we want to get metadata from file names, string: The name of the column containing metadata, list (optional): List of lists of unique publications references and publication names in the collection provided in the same order.	table	https://github.com/DHARPA-Project/kiara_plugin.dh_benelux_2023/blob/develop/src/kiara_plugin/dh_benelux_2023/modules/metadata.py	1
Corpus table preparation		more options for metadata from file names
Corpus table preparation	get corpus distribution	prepare data for visualization of corpus’ documents distribution	string: desired periodicity (day/month/year/), string: column_name, table: corpus table	list: list of dicts to visualize (output type and format can be modified to fit front-end input type)	https://github.com/DHARPA-Project/kiara_plugin.dh_benelux_2023/blob/develop/src/kiara_plugin/dh_benelux_2023/modules/visualization.py	1	visualization prototype: https://observablehq.com/@dharpa-project/timestamped-corpus
Corpus table preparation	filter table	Create a subset of the dataset	table: corpus table to query, query: sql query to filter table	table: corpus table subset	query.table module from	1	Would there be a way to facilitate the query creation for users who may not be familiar with sql?

Roadmap updates: 2023/12/04: added onboarding module to handle urls

caro401 commented 9 months ago

Would there be a way to facilitate the query creation for users who may not be familiar with sql?

yes, but its a hard UI problem so I'd rather not do it for the first version if that's ok? Unless you can scope it to a few very specific kinds of queries (eg you can pick a date range without sql, but anything else is sql)

MariellaCC commented 9 months ago

@caro401

I renamed the "version" column to "Jupyter Notebook version" to disambiguate which prototype version is meant by that, as what I meant is info on the goals on my side, in terms of modules planned on the plugin side for version 1 of related Jupyter Notebook, but this does not assume anything about UI plans/version. The UI column is an attempt to reflect discussions or requests/comments that I could remember from the past (I renamed related column accordingly too to clarify what it is), and it is to be checked with @lorellav.

From the subset creation module side of things, for the first version I plan on scoping it to filter by date and/or by publication name/id.

DHARPA-Project / kiara_plugin.topic_modelling

Modules roadmap data onboarding until subset creation #1