RFC0006: Filter works with both umed and OCRed modern publication
Named Concepts
RDFLib: library for RDF data parsing
GitPython: library for GitLab interaction
requests library: It provides a convenient way to handle HTTP communication, including authentication, error handling, and data retrieval, in Python applications.
Summary
The task at hand involves filtering works with both UMED and Optical Character Recognition (OCR'ed) modern publications in a given dataset. This process will result in a list of works that fulfill this criteria, and it will include URLs for downloading the eText files.
Dependencies
Python
RDFLib
GitPython
Infrastructures
GitLab repository containing the RDF dataset.
The requests library is a popular Python library used for making HTTP requests to interact with web services, APIs, and websites, we will use it get data information from bdrc website.
Design Illustrations
*
Explaination:
Authenticate gitlab: to have an authorized access to gitlab data.
Parse workdata 1st phase: here we will parse the work data to get instance id of the work.
Dictionary: for storing a details of workdata, it will be nested dictionary of three levels at max.
make search engine request: here we will search for more details of work data from the web site of bdrc.
parse workdata 2nd phase: we will parse the data from the web for a given instance id to retrieve info about print method and scrip style.
filter dictionary for required work data: create a list of data that matches the criteria and their corresponding url.*
Justification
The proposed design was selected as it provides a straightforward and effective way to identify works with UMED and OCR'ed modern publications. Alternatives, such as manual inspection, would be time-consuming and error-prone. This automated approach will greatly improve efficiency.
The impact of not using this approach would result in more manual work and potentially missing relevant works with both UMED and OCR'ed versions.
Testing
Manual testing of the script to ensure it correctly extracts the required data from the RDF dataset.
Verification of the accuracy of URLs for UMED and OCR'ed versions.*
Implementation Steps
List all the steps involved during implementation.
[ ] OpenPecha/filtering_work_data#1
Estimated time: 1 hour
Actual time:
[ ] OpenPecha/filtering_work_data#2
Estimated time: 1hour
Actual time:
[ ] OpenPecha/filtering_work_data#3
Estimated time: 1 hour
Actual time:
[ ] OpenPecha/filtering_work_data#4
Estimated time: 1 hour
Actual time:
[ ] OpenPecha/filtering_work_data#5
Estimated time: 1 hour
Actual time:
[ ] OpenPecha/filtering_work_data#6
Estimated time: 1 hour
Actual time:
[ ] OpenPecha/filtering_work_data#7
Estimated time: 1 hour
Actual time:
[ ] OpenPecha/filtering_work_data#8
Estimated time: 1 hour
Actual time:
[ ] OpenPecha/filtering_work_data#9
Estimated time: 1 hour
Actual time:
RFC0006: Filter works with both umed and OCRed modern publication
Named Concepts
RDFLib: library for RDF data parsing GitPython: library for GitLab interaction requests library: It provides a convenient way to handle HTTP communication, including authentication, error handling, and data retrieval, in Python applications.
Summary
The task at hand involves filtering works with both UMED and Optical Character Recognition (OCR'ed) modern publications in a given dataset. This process will result in a list of works that fulfill this criteria, and it will include URLs for downloading the eText files.
Dependencies
Python RDFLib GitPython
Infrastructures
GitLab repository containing the RDF dataset. The requests library is a popular Python library used for making HTTP requests to interact with web services, APIs, and websites, we will use it get data information from bdrc website.
Design Illustrations
* Explaination:
Justification
The proposed design was selected as it provides a straightforward and effective way to identify works with UMED and OCR'ed modern publications. Alternatives, such as manual inspection, would be time-consuming and error-prone. This automated approach will greatly improve efficiency.
The impact of not using this approach would result in more manual work and potentially missing relevant works with both UMED and OCR'ed versions.
Testing
Implementation Steps
List all the steps involved during implementation.
Reviewed By
@ta4tsering