OpenPecha / Requests

RFWs and RFCs for all OpenPecha repositories
0 stars 0 forks source link

[RFC0060]: Filter works with both umed and OCRed modern publication #235

Open gangagyatso4364 opened 1 year ago

gangagyatso4364 commented 1 year ago

RFC0006: Filter works with both umed and OCRed modern publication

Named Concepts

RDFLib: library for RDF data parsing GitPython: library for GitLab interaction requests library: It provides a convenient way to handle HTTP communication, including authentication, error handling, and data retrieval, in Python applications.

Summary

The task at hand involves filtering works with both UMED and Optical Character Recognition (OCR'ed) modern publications in a given dataset. This process will result in a list of works that fulfill this criteria, and it will include URLs for downloading the eText files.

Dependencies

Python RDFLib GitPython

Infrastructures

GitLab repository containing the RDF dataset. The requests library is a popular Python library used for making HTTP requests to interact with web services, APIs, and websites, we will use it get data information from bdrc website.

Design Illustrations

* Flowcharts_page-0001 Explaination:

  1. Authenticate gitlab: to have an authorized access to gitlab data.
  2. Parse workdata 1st phase: here we will parse the work data to get instance id of the work.
  3. Dictionary: for storing a details of workdata, it will be nested dictionary of three levels at max.
  4. make search engine request: here we will search for more details of work data from the web site of bdrc.
  5. parse workdata 2nd phase: we will parse the data from the web for a given instance id to retrieve info about print method and scrip style.
  6. filter dictionary for required work data: create a list of data that matches the criteria and their corresponding url.*

Justification

The proposed design was selected as it provides a straightforward and effective way to identify works with UMED and OCR'ed modern publications. Alternatives, such as manual inspection, would be time-consuming and error-prone. This automated approach will greatly improve efficiency.

The impact of not using this approach would result in more manual work and potentially missing relevant works with both UMED and OCR'ed versions.

Testing

Implementation Steps

List all the steps involved during implementation.

Reviewed By

@ta4tsering

ta4tsering commented 1 year ago

put in the estimated time