DHARPA-Project / kiara_plugin.topic_modelling

Mozilla Public License 2.0
0 stars 2 forks source link

Modules roadmap data onboarding until subset creation #1

Open MariellaCC opened 9 months ago

MariellaCC commented 9 months ago

@lorellav @caro401

Here is a proposition for modules list, for the first steps until subset creation.

This list is based on the work done in the following repos: https://github.com/DHARPA-Project/TopicModelling- and https://github.com/DHARPA-Project/kiara_plugin.dh_benelux_2023.

Please let me know if you have comments/questions and/or feedback.

Pipeline step Module name Module scope Foreseen Inputs Foressen Outputs Code example/ as seen as in.. Planned for Jupyter notebook version past UI suggestions/discussions
Data onboarding get zenodo text files Get a zip file from zenodo, check for text files within the zip, and create a table containing two columns: file names, and file contents. Append two columns containing number of words and number of chars. string: doi (example: 4596345), string: file name (example: "ChroniclItaly_3.0_original.zip”) table this JDH article imports directly from zenodo: https://journalofdigitalhistory.org/en/article/WBqfZzfi7nHK?idx=76&layer=hermeneutics&lh=674&pidx=76&pl=narrative&y=158 1 Preview of the table, maybe in the future, if/when possible: visualization of the distribution of characters/words distribution across documents
Data onboarding get url text files Get files from direct link string: url such as kiara.examples-main/examples/workshops/dh_benelux_2023/data table module download.file_bundle from https://github.com/DHARPA-Project/kiara_plugin.tabular 1 Preview of the table, maybe in the future, if/when possible: visualization of the distribution of characters/words distribution across documents
Corpus table preparation get lccn metadata get metadata from strings that comply with LCCN pattern: '/sn86069873/1900-01-05/' to get the publication ids and the dates and add that informations as two new columns. Optionally, publication names can be mapped by users. table: The corpus for which we want to get metadata from file names, string: The name of the column containing metadata, list (optional): List of lists of unique publications references and publication names in the collection provided in the same order. table https://github.com/DHARPA-Project/kiara_plugin.dh_benelux_2023/blob/develop/src/kiara_plugin/dh_benelux_2023/modules/metadata.py 1
Corpus table preparation more options for metadata from file names
Corpus table preparation get corpus distribution prepare data for visualization of corpus’ documents distribution string: desired periodicity (day/month/year/), string: column_name, table: corpus table list: list of dicts to visualize (output type and format can be modified to fit front-end input type) https://github.com/DHARPA-Project/kiara_plugin.dh_benelux_2023/blob/develop/src/kiara_plugin/dh_benelux_2023/modules/visualization.py 1 visualization prototype: https://observablehq.com/@dharpa-project/timestamped-corpus
Corpus table preparation filter table Create a subset of the dataset table: corpus table to query, query: sql query to filter table table: corpus table subset query.table module from 1 Would there be a way to facilitate the query creation for users who may not be familiar with sql?

Roadmap updates: 2023/12/04: added onboarding module to handle urls

caro401 commented 9 months ago

Would there be a way to facilitate the query creation for users who may not be familiar with sql?

yes, but its a hard UI problem so I'd rather not do it for the first version if that's ok? Unless you can scope it to a few very specific kinds of queries (eg you can pick a date range without sql, but anything else is sql)

MariellaCC commented 9 months ago

@caro401

I renamed the "version" column to "Jupyter Notebook version" to disambiguate which prototype version is meant by that, as what I meant is info on the goals on my side, in terms of modules planned on the plugin side for version 1 of related Jupyter Notebook, but this does not assume anything about UI plans/version. The UI column is an attempt to reflect discussions or requests/comments that I could remember from the past (I renamed related column accordingly too to clarify what it is), and it is to be checked with @lorellav.

From the subset creation module side of things, for the first version I plan on scoping it to filter by date and/or by publication name/id.