Open DaChad1523 opened 2 months ago
Task 1 Completed
import os, csv
csv_D=r'C:\Study\WINT01 Bachelor of IT\First Year Sem2\HIT137 Software now\Assignment 2'
output_file = 'output.txt'
with open(output_file, 'w', encoding='utf-8') as txt_file:
for csv_filename in os.listdir(csv_D):
if csv_filename.endswith('.csv'):
csv_path=os.path.join(csv_D, csv_filename)
with open(csv_path, 'r', encoding='utf-8') as csv_file:
reader=csv.reader(csv_file)
for row in reader:
txt_file.write(' '.join(row)+'\n')
This question consists of multiple CSV files (In the Zipped Folder) with ‘large texts’ in one of the columns in each file. Your job is to use the open-source NLP (Natural Language Processing) libraries and perform various tasks.
Task 1:
Extract the ‘text’ in all the CSV files and store them into a single ‘.txt file’.
Task 2: Research
Install the libraries(SpaCy – scispaCy – ‘en_core_sci_sm’/’en_ner_bc5cdr_md’). Install the libraries (Transformers (Hugging Face) - and any bio-medical model (BioBert) that can detect drugs, diseases, etc from the text).
Task 3: Programming and Research 3.1:
Using any in-built library present in Python, count the occurrences of the words in the text (.txt) and give the ‘Top 30’ most common words.
And store the ‘Top 30’ common words and their counts into a CSV file.
3.2: Using the ‘Auto Tokenizer’ function in the ‘Transformers’ library, write a ‘function’ to count unique tokens in the text (.txt) and give the ‘Top 30’ words. Task 4: Named-Entity Recognition (NER) Extract the ‘diseases’, and ‘drugs’ entities in the ‘.txt file’ separately using ‘en_core_sci_sm’/’en_ner_bc5cdr_md’ and biobert. And compare the differences between the two models (Example: Total entities detected by both of them, what’s the difference, check for most common words, and check the difference.)