calzada / PARLAMINT-ES-MC

2 stars 4 forks source link

seg ID parents #14

Closed calzada closed 3 years ago

calzada commented 3 years ago

Dear Tomaz, I have asked for help from two very talented NLP experts. Luciana asks me this:

"How are the seg id parents created (as in )? I've been trying to use an adapted version of classlisize.py for Spanish, and I noticed it requires a parent ID, are they created randomly, or are listed somewhere?

Sorry if this info is already presented in a documentation."


If you answer, whenever you can, I will re-direct this information or I can also give both of the rights to work here. Best for now, mc

TomazErjavec commented 3 years ago

It seems like the example is missing from your question. If the question was "how to create IDs of segments", I've done this now in 6d9d23e (should've done it before!). If the question was "how are IDs of u element created" well, they just go one after the other, e.g.

<u xml:id="ParlaMint-ES_2020-12-15-CD201215.u1" ...>
...
 <u xml:id="ParlaMint-ES_2020-12-15-CD201215.u2" ...>

but that much is obvious by just looking at some file.

Pls. note that we are not quite finished with the corpus, there are still some things to make better. But I didn't find the time yet, as I has to look the corpora for lots of other languages. Of course, they can already set up the annotation pipeline, and it should be simple to run it again, once we have the base corpus ready.

calzada commented 3 years ago

Excellent, I will circulate your answer and do not worry. We will wait as much as you need since we realize you are extremely busy. I am so embarrassed that I am giving you so much work that I am trying to lift of the load. Please, let me know if you need anything from us. Have a nice weekend, great Tomaz https://www.youtube.com/watch?v=92wf6LM8wh8

Best for now, mc

calzada commented 3 years ago

Dear Tomaz, Another email bu Luciana:

Hi Tomaž,

I actually meant the id that every seg has. For example, from the sample: https://github.com/clarin-eric/ParlaMint/blob/a1110008eae5bc837d111bf46aa405671948fd13/ParlaMint-PL/ParlaMint-PL_2015-11-12-senat-01-1.ana.xml#L1735

I noticed that the script needs one parent ID to create the following word IDs so that we can display the dependency relations https://github.com/clarin-eric/ParlaMint/blob/a1110008eae5bc837d111bf46aa405671948fd13/ParlaMint-PL/ParlaMint-PL_2015-11-12-senat-01-1.ana.xml#L1779.

I'd like to know how this id="segXXXXXX" is created.

I intend to use an adaptation of https://github.com/clarin-eric/ParlaMint/blob/a1110008eae5bc837d111bf46aa405671948fd13/Scripts/classlisize.py to create the other child ids. For instance, , , and so on.

Thank you so much again, Luciana.

Best mc

TomazErjavec commented 3 years ago

I'd like to know how this id="segXXXXXX" is created.

Well, it doesn't really matter, as long as the seg id is unique in the corpus. But, for ES, I just appended .$n to the u ID: https://github.com/calzada/PARLAMINT-ES-MC/blob/6d9d23e8d1fc88cfc3e8db9065954c1b6e19e7cc/ParlaMint/ParlaMint-ES_2015-01-20-CD150120.xml#L101

@calzada, if you have XML tags in the text of the issue, you need to put them in backticks (inverted apostrophes), like <u>; if you just write the tag, as in , strange things happen, and I then don't see the examples. Have a look at the MarkDown guide, Inline code.

calzada commented 3 years ago

OK. Tomaz. Noted. And thanks for your help.

Best for now, mc

TomazErjavec commented 3 years ago

And this has been settled as well.