Open Niharikajo opened 8 months ago
Hi, Thank for your attention. I think this code is right. If you have any error. You can let me check. And Do you know how to install and use the Openprompt liblary.
Thank you for replying,
I cloned the OpenPromt github repo. I am using google colab for execution. After installing the requirements, I was trying to run the Roberta prompt code, and data_utils.text_classification_dataset does not contain resumeProcessor class. So I was getting ModuleNotFoundError.
Could you please let me know what changes I need to make
Thankyou
Ok, I got it. I custom a few class or def in the OpenPrompt source code. So openprompt can use in a resume dataset format. I will update that part'code. I need some time to find code. Because it past too long time. Please wait half hour. If you can custom yourself in the openprompt framework source coed. You can do that by yourself too.
resume-IE-via-prompt /OpenPromtCustomSourceCode
Hi Niharikajo. I upadated a new code file. You can follow the instruction of code file. To replace the OpenPrompt framework' sourcecode for adapt seven-class resume dataset fromat.
Thank you for replying,
I cloned the OpenPromt github repo. I am using google colab for execution. After installing the requirements, I was trying to run the Roberta prompt code, and data_utils.text_classification_dataset does not contain resumeProcessor class. So I was getting ModuleNotFoundError.
Could you please let me know what changes I need to make
Thankyou
If you go thourgh it. You can let me know. Or you encounter anthor error. You also can ask me.
Thank you for updating the file so soon. I made the changes you mentioned.
I'm getting an error
ValueError: invalid literal for int() with base 10: 'PI'
I tried to encode the labels into integers but the error remains
Can you please let me know how to resolve this error
Thankyou
Sorry for lately responese. But I think this error is the Dataset' PI label problem. Maybe you should check out dataset format and input step code. Or locate this error code line. And check line by line.
Im using resume-seven-class dataset. Error Code line is: example = InputExample(guid=str(idx), text_a=text_a, label=int(label)-1)
Any suggestion on what I should try?
Thankyou
I got this. Try following code. replace the Personal Information to PI self.labels = ["PI", "Experience", "Summary", "Education","Qualification Certification", "Skill","Object"]
class resumeProcessor(DataProcessor):
"""
AG News <https://arxiv.org/pdf/1509.01626.pdf>
_ is a News Topic classification dataset
we use dataset provided by `LOTClass <https://github.com/yumeng5/LOTClass>`_
Examples:
.. code-block:: python
from openprompt.data_utils.text_classification_dataset import PROCESSORS
base_path = "datasets/TextClassification"
dataset_name = "agnews"
dataset_path = os.path.join(base_path, dataset_name)
processor = PROCESSORS[dataset_name.lower()]()
trainvalid_dataset = processor.get_train_examples(dataset_path)
test_dataset = processor.get_test_examples(dataset_path)
assert processor.get_num_labels() == 4
assert processor.get_labels() == ["World", "Sports", "Business", "Tech"]
assert len(trainvalid_dataset) == 120000
assert len(test_dataset) == 7600
assert test_dataset[0].text_a == "Fears for T N pension after talks"
assert test_dataset[0].text_b == "Unions representing workers at Turner Newall say they are 'disappointed' after talks with stricken parent firm Federal Mogul."
assert test_dataset[0].label == 2
"""
def __init__(self):
super().__init__()
self.labels = ["PI", "Experience", "Summary", "Education","Qualification Certification", "Skill","Object"]
def get_examples(self, data_dir, split):
path = os.path.join(data_dir, "{}.csv".format(split))
examples = []
with open(path, encoding="utf-8-sig") as f:
reader = csv.reader(f, delimiter=',')
for idx, row in enumerate(reader):
label, headline = row
text_a = headline.replace('\\', ' ')
# text_b = body.replace('\\', ' ')
# print(label)
example = InputExample(guid=str(idx), text_a=text_a, label=int(label)-1)
examples.append(example)
return examples
This problem because the dataset is PI. But input code is Personal Information.
Yes, I had tried changing Personal Information
to PI
, summary
to Sum
, so on in lables. But I am still getting the same error
ValueError: invalid literal for int() with base 10: 'PI'
I got it. Sorry ,I'm very sorry. I forgot I directly change the dataset label. From label word to Num. Like this
1,Responsibilities: 1,Met with users to generate and review business use cases. Assessed the status of the organization to determine the scope of the validation process. 1,"Prepared requirements document for Commercial Auto, Inland Marine, Crime, Worker’s Compensation, Umbrella, Business Owners Policy, Commercial Output Policy, and Commercial Property Package." 1,"Also, responsible for managing communication and expectations of system vendor, the former parent company IT and business departments, and Allied Worlds various business units (underwriting, claims, reinsurance, actuary, accounting, and IT)" 1,"Created the configuration document for custom setup for various user groups such as HR, marketing, R&D & sales, research analyst & investigators." 1,Created use cases to depict the interaction between the various actors and the system. Facilitated collection of 1,"Tested HIPAA Gateway Application Interface for all inbound and outbound messages (Healthcare Eligibility 270 and 271, Healthcare Claim Status request 276 and 277, Healthcare Claim 837 and 835)" 1,"Involved in detailing project mission, Data Process Flow Diagrams and timelines. Defined business Use Cases and activity diagrams to represent different workflows and associations." 1,Worked with the compliance group to make sure that the electronic data was CFR part 11 compliant. 1,"Gathered requirements by using interviews, requirement workshops and brainstorming sessions." 1,"Acted as a liaison between business staff and technical staff to articulate needs, issues and concerns as per GLP in LabWare LIMS & Pre-Clinical Phases (electronic laboratory notebook) & data migration issues." 1,Designed and developed project document templates based on SDLC methodology 1,"Documented all aspects of the computer system validation lifecycle, in accordance with FDA regulation which includes validation plan and protocol, Installation Qualification (IQ), Operational Qualification (OQ) and specification performance. Worked in Healthcare HIPAA ICD 9-CM to ICD 10-CM rule set migration." 1,Responsible for analyzing the current system and followed the development of a J2EE based application through various iterations of all phases of the Rational Unified Process (RUP). 1,Validate test plans/scripts and perform final reviews of test results. 1,Used use case diagram during analysis to capture requirements. Conducted acceptance tests to verify that the validation effort was complete 1,Developed strategies with Quality Assurance group to implement Test Cases in Mercury Test Director for stress testing and UAT (User Acceptance Testing). 1,"Environment: Rational Rose, UML, Java, RUP, Windows XP, Rational RequisitePro, Microsoft Office tools, MS Project, SQL" 1,"Company: Biological.E.Ltd, IND" 1,Position: Business Analyst June 2010 – July 2011 1,"The company provides a variety of personal insurance products, including Auto insurance, Homeowners insurance, Marine Coverage’s, Personal liability insurance, and life policies (Life insurance)." 1,Project: Online Account Access system 1,"The project was to develop a web-based application relating to a comprehensive online request for auto insurance and health insurance quote processing. The system runs on Mainframe and has a web-integrated front-end that provides free auto insurance quotes to individuals and for families. This project is a web-based application which allows the customers to pay the bills online, get an online quote, report a claim, view policy, view the claim status and verify the account balances etc." 1,Responsibilities:
The 1 is mean Experience label.
So you need change the seven class label to number with manully. In the dataset file i.e. .CSV
['Exp','PI','Sum','Edu','QC','Skill','Obj'] 1 2 3 4 5 6 7
In the dataset . Exp to 1 PI to 2 ... like this
Thank you so much, now I'm able to load the data. I made the changes directly in the csv file
In the next steps, I had a few doubts
should we do the test train split on the dataset beforehand as I could not find the code for test train split.
while loading plm pytorch_model/pytorch_roberta_large
is not present in huggingface, so im getting an error
OSError: pytorch_model/pytorch_roberta_large is not a local folder and is not a valid model identifier listed on 'https://huggingface.co/models' If this is a private repository, make sure to pass a token having permission to this repo either by logging in with huggingface-cli login or by passing token=<your_token>
from where can I download this ?
Can you help me with this?
Thankyou Niharika
Thanks for the clarification on test train split
I used the roberta-base model directly from huggingface
Im getting an error:
ValueError: list.remove(x): x not in list
at this line
What can be my mistake here?
Sorry, I try to figure out issue. But I failed. Maybe you can looking for the ChatGPT?
Hello,
Thank you for releasing the code. I am trying to recreate your results
In prompt roberta, resumeProcesssor code is not given.
Could you please provide this code?
Thankyou Niharika Joshi