TimeLessLing / DPSG-code

3 stars 0 forks source link

Update instruction to use the code #1

Open cyr19 opened 1 year ago

cyr19 commented 1 year ago

Hello,

thanks for your great work!!

I'm very interested in your approach and want to run your work on UD. Could you maybe write more instructions in the README file?

Best,

raypretam commented 10 months ago

Thanks for the great work. Pretty much impressed by the fact to see dependency parsing as a seq2seq task. I am working on related work but for low resource languages. Could you write some instructions in the README file on UD datasets?

Thanks and Regards.

TimeLessLing commented 10 months ago

Thank you very much for your attention and recognition of our work. I am also very sorry. Since my mailbox often receives various emails, I ignored the prompt email from github in May. This time I noticed the issue e-mail from user @raypretam . I only noticed this issue after receiving this reminder email.

For UD, the basic processing is completely the same as PTB. You only need to replace plm_path with the path file of your mT5 in the run_train.sh file, and replace model_type, model_name and tokenizer_name with mt5-base. As for the specific configuration file to choose, I have uploaded an example for Bulgarian (bg), located in sydp/ud_bg.py. You can refer to this example to make corresponding modifications for the ud language you choose. It should be noted that the part of speech used in this example is UPOS, which is universal and should be common to all UD data.

As for the preprocessing of data, it is the same as the processing method of PTB, which can be converted into the sequence results shown in the paper.

TimeLessLing commented 10 months ago

Thanks for the great work. Pretty much impressed by the fact to see dependency parsing as a seq2seq task. I am working on related work but for low resource languages. Could you write some instructions in the README file on UD datasets?

Thank you for your attention, I have reply this issue in the main reply.

raypretam commented 10 months ago

The dataset used in the repo is a json file with 'input', 'output' as attributes and not as a conllu file. How do you generate them? This paper, which uses previous works of Ma and Gan, they never used JSON files as their inputs. Their preprocessed inputs were always conllu format. Am I missing something? Please help me with this. Could you provide a sample json file?

Thanks and Regards.

TimeLessLing commented 10 months ago

The dataset used in the repo is a json file with 'input', 'output' as attributes and not as a conllu file. How do you generate them? This paper, which uses previous works of Ma and Gan, they never used JSON files as their inputs. Their preprocessed inputs were always conllu format. Am I missing something? Please help me with this. Could you provide a sample json file?

Hi, I have upload a data file for bg_btb-dev.conllu of UD2.2 and the related data processing files.