ganchengguang / resume-IE-via-prompt

paper code
1 stars 0 forks source link

resumeProcesssor code #1

Open Niharikajo opened 8 months ago

Niharikajo commented 8 months ago

Hello,

Thank you for releasing the code. I am trying to recreate your results

In prompt roberta, resumeProcesssor code is not given.

Could you please provide this code?

Thankyou Niharika Joshi

ganchengguang commented 8 months ago

Hi, Thank for your attention. I think this code is right. If you have any error. You can let me check. And Do you know how to install and use the Openprompt liblary.

Niharikajo commented 8 months ago

Thank you for replying,

I cloned the OpenPromt github repo. I am using google colab for execution. After installing the requirements, I was trying to run the Roberta prompt code, and data_utils.text_classification_dataset does not contain resumeProcessor class. So I was getting ModuleNotFoundError.

Could you please let me know what changes I need to make

Thankyou

ganchengguang commented 8 months ago

Ok, I got it. I custom a few class or def in the OpenPrompt source code. So openprompt can use in a resume dataset format. I will update that part'code. I need some time to find code. Because it past too long time. Please wait half hour. If you can custom yourself in the openprompt framework source coed. You can do that by yourself too.

ganchengguang commented 8 months ago

resume-IE-via-prompt /OpenPromtCustomSourceCode

Hi Niharikajo. I upadated a new code file. You can follow the instruction of code file. To replace the OpenPrompt framework' sourcecode for adapt seven-class resume dataset fromat.

ganchengguang commented 8 months ago

Thank you for replying,

I cloned the OpenPromt github repo. I am using google colab for execution. After installing the requirements, I was trying to run the Roberta prompt code, and data_utils.text_classification_dataset does not contain resumeProcessor class. So I was getting ModuleNotFoundError.

Could you please let me know what changes I need to make

Thankyou

If you go thourgh it. You can let me know. Or you encounter anthor error. You also can ask me.

Niharikajo commented 8 months ago

Thank you for updating the file so soon. I made the changes you mentioned.

I'm getting an error ValueError: invalid literal for int() with base 10: 'PI'

I tried to encode the labels into integers but the error remains

Can you please let me know how to resolve this error

Thankyou

ganchengguang commented 8 months ago

Sorry for lately responese. But I think this error is the Dataset' PI label problem. Maybe you should check out dataset format and input step code. Or locate this error code line. And check line by line.

Niharikajo commented 8 months ago

Im using resume-seven-class dataset. Error Code line is: example = InputExample(guid=str(idx), text_a=text_a, label=int(label)-1)

Any suggestion on what I should try?

Thankyou

ganchengguang commented 8 months ago

I got this. Try following code. replace the Personal Information to PI self.labels = ["PI", "Experience", "Summary", "Education","Qualification Certification", "Skill","Object"]

ganchengguang commented 8 months ago

replace following code from site-packages\openprompt\data_utils\text_classification_dataset.py

class resumeProcessor(DataProcessor): """ AG News <https://arxiv.org/pdf/1509.01626.pdf>_ is a News Topic classification dataset

we use dataset provided by `LOTClass <https://github.com/yumeng5/LOTClass>`_

Examples:

..  code-block:: python

    from openprompt.data_utils.text_classification_dataset import PROCESSORS

    base_path = "datasets/TextClassification"

    dataset_name = "agnews"
    dataset_path = os.path.join(base_path, dataset_name)
    processor = PROCESSORS[dataset_name.lower()]()
    trainvalid_dataset = processor.get_train_examples(dataset_path)
    test_dataset = processor.get_test_examples(dataset_path)

    assert processor.get_num_labels() == 4
    assert processor.get_labels() == ["World", "Sports", "Business", "Tech"]
    assert len(trainvalid_dataset) == 120000
    assert len(test_dataset) == 7600
    assert test_dataset[0].text_a == "Fears for T N pension after talks"
    assert test_dataset[0].text_b == "Unions representing workers at Turner   Newall say they are 'disappointed' after talks with stricken parent firm Federal Mogul."
    assert test_dataset[0].label == 2
"""

def __init__(self):
    super().__init__()
    self.labels = ["PI", "Experience", "Summary", "Education","Qualification Certification", "Skill","Object"]

def get_examples(self, data_dir, split):
    path = os.path.join(data_dir, "{}.csv".format(split))
    examples = []
    with open(path, encoding="utf-8-sig") as f:
        reader = csv.reader(f, delimiter=',')
        for idx, row in enumerate(reader):
            label, headline = row
            text_a = headline.replace('\\', ' ')
            # text_b = body.replace('\\', ' ')
            # print(label)
            example = InputExample(guid=str(idx), text_a=text_a,  label=int(label)-1)
            examples.append(example)
    return examples
ganchengguang commented 8 months ago

This problem because the dataset is PI. But input code is Personal Information.

Niharikajo commented 8 months ago

Yes, I had tried changing Personal Information to PI, summary to Sum , so on in lables. But I am still getting the same error ValueError: invalid literal for int() with base 10: 'PI'

ganchengguang commented 8 months ago

I got it. Sorry ,I'm very sorry. I forgot I directly change the dataset label. From label word to Num. Like this

1,Responsibilities: 1,Met with users to generate and review business use cases. Assessed the status of the organization to determine the scope of the validation process. 1,"Prepared requirements document for Commercial Auto, Inland Marine, Crime, Worker’s Compensation, Umbrella, Business Owners Policy, Commercial Output Policy, and Commercial Property Package." 1,"Also, responsible for managing communication and expectations of system vendor, the former parent company IT and business departments, and Allied Worlds various business units (underwriting, claims, reinsurance, actuary, accounting, and IT)" 1,"Created the configuration document for custom setup for various user groups such as HR, marketing, R&D & sales, research analyst & investigators." 1,Created use cases to depict the interaction between the various actors and the system. Facilitated collection of 1,"Tested HIPAA Gateway Application Interface for all inbound and outbound messages (Healthcare Eligibility 270 and 271, Healthcare Claim Status request 276 and 277, Healthcare Claim 837 and 835)" 1,"Involved in detailing project mission, Data Process Flow Diagrams and timelines. Defined business Use Cases and activity diagrams to represent different workflows and associations." 1,Worked with the compliance group to make sure that the electronic data was CFR part 11 compliant. 1,"Gathered requirements by using interviews, requirement workshops and brainstorming sessions." 1,"Acted as a liaison between business staff and technical staff to articulate needs, issues and concerns as per GLP in LabWare LIMS & Pre-Clinical Phases (electronic laboratory notebook) & data migration issues." 1,Designed and developed project document templates based on SDLC methodology 1,"Documented all aspects of the computer system validation lifecycle, in accordance with FDA regulation which includes validation plan and protocol, Installation Qualification (IQ), Operational Qualification (OQ) and specification performance. Worked in Healthcare HIPAA ICD 9-CM to ICD 10-CM rule set migration." 1,Responsible for analyzing the current system and followed the development of a J2EE based application through various iterations of all phases of the Rational Unified Process (RUP). 1,Validate test plans/scripts and perform final reviews of test results. 1,Used use case diagram during analysis to capture requirements. Conducted acceptance tests to verify that the validation effort was complete 1,Developed strategies with Quality Assurance group to implement Test Cases in Mercury Test Director for stress testing and UAT (User Acceptance Testing). 1,"Environment: Rational Rose, UML, Java, RUP, Windows XP, Rational RequisitePro, Microsoft Office tools, MS Project, SQL" 1,"Company: Biological.E.Ltd, IND" 1,Position: Business Analyst June 2010 – July 2011 1,"The company provides a variety of personal insurance products, including Auto insurance, Homeowners insurance, Marine Coverage’s, Personal liability insurance, and life policies (Life insurance)." 1,Project: Online Account Access system 1,"The project was to develop a web-based application relating to a comprehensive online request for auto insurance and health insurance quote processing. The system runs on Mainframe and has a web-integrated front-end that provides free auto insurance quotes to individuals and for families. This project is a web-based application which allows the customers to pay the bills online, get an online quote, report a claim, view policy, view the claim status and verify the account balances etc." 1,Responsibilities:

The 1 is mean Experience label.

ganchengguang commented 8 months ago

So you need change the seven class label to number with manully. In the dataset file i.e. .CSV

ganchengguang commented 8 months ago

['Exp','PI','Sum','Edu','QC','Skill','Obj'] 1 2 3 4 5 6 7

ganchengguang commented 8 months ago

In the dataset . Exp to 1 PI to 2 ... like this

Niharikajo commented 8 months ago

Thank you so much, now I'm able to load the data. I made the changes directly in the csv file

In the next steps, I had a few doubts

  1. should we do the test train split on the dataset beforehand as I could not find the code for test train split.

  2. while loading plm pytorch_model/pytorch_roberta_large is not present in huggingface, so im getting an error OSError: pytorch_model/pytorch_roberta_large is not a local folder and is not a valid model identifier listed on 'https://huggingface.co/models' If this is a private repository, make sure to pass a token having permission to this repo either by logging in with huggingface-cli login or by passing token=<your_token> from where can I download this ?

Can you help me with this?

Thankyou Niharika

ganchengguang commented 8 months ago
  1. Yes, you should manually split dataset to two csv file (In mine work, I divide two set. train is 50000 samples. test is 30000 samples.) And you should randomly sample 50-shot from 50k tranin set. Like this: sampler = FewShotSampler(num_examples_total=50, also_sample_dev=False)
  2. You should download all of file in local folder from huggingface website. Then from_pretrain('model_folder_name'). plm, tokenizer, model_config, WrapperClass = load_plm("roberta", "pytorch_model/pytorch_roberta_large")
Niharikajo commented 8 months ago

Thanks for the clarification on test train split

I used the roberta-base model directly from huggingface

Im getting an error: ValueError: list.remove(x): x not in list at this line

image

What can be my mistake here?

ganchengguang commented 8 months ago

Sorry, I try to figure out issue. But I failed. Maybe you can looking for the ChatGPT?