calzada / PARLAMINT-ES-MC

2 stars 4 forks source link

CD files added #7

Closed calzada closed 3 years ago

calzada commented 3 years ago

I have now inserted the final files for CD (Spanish Congress) files for 2016-2020. I hope this is what you wanted. I made sure they were are valid with xmllint -valid -noout *.xml.

TomazErjavec commented 3 years ago

Thanks, yes, this is exactly what I wanted. So, if I understand correct, this is the final set of files you will have for now? I've already added them to the generation procedure, more on this shortly.

PLEASE GIVE ME AN ADDITIONAL (MAYBE EVEN LESS) DAY AND MAYBE I CAN INCLUDE 2015. THEN I WILL DO THE SAME AS THE REST OF THE TEAM.

TomazErjavec commented 3 years ago

OK, sure. I even now have to modify my scripts to deal with the new data. e.g. who would've thought that you use Mª for Maria, or 00000000 for unknown birth date (in addition to the standard UNKNOWN), but I am coping so far!

PS: pls. don't edit my comment, but write a new one, otherwise I am not notified, it was by chance I saw this.

calzada commented 3 years ago

Excellent. Tomaz. Thank you sooooo much. And I will write now comments. You are teaching me so much that I have already started calling you Sensei!! I think (keep your fingers crossed) I will have 2015 ready as well. Then Spain will be a normal country!!! ;-) Best for now

TomazErjavec commented 3 years ago

OK, latest commit 4d5106a538e6 has ParlaMint files on the basis of the latest CD, and the previous 0997b15df3db one lots of changes to the conversion. Now it (almost) validates, but there are still things that need to be taken into account, fixed and also manually added - but more about that later, for now happy that it validates as much as it does!

calzada commented 3 years ago

Excellent. I am working hard to add 2015. I am getting there. Thanks for all your work!! mc P.S. A small gift for all your help: https://www.youtube.com/watch?v=wDjeBNv6ip0

TomazErjavec commented 3 years ago

I am working hard to add 2015. I am getting there.

very good!

A small gift

Thank you, nice one!

calzada commented 3 years ago

Tomaz. Could you please give me Thursday morning. I think I will finish by then. I am very tired now and am going to bed. But I will finish tomorrow by 15.00 hours. Good night!! mc

calzada commented 3 years ago

TOMAAAAAAAZZZZZ, JUST ON TIME!!!

All files ready!! From 2015-2020. Like the rest of the teams. And I have done everything from end January to now. I feel proud. You can now proceed with TEI conversion and the rest.

ALL FILES IN PARLAMINT-ES-MC/CD directory.

Could you please put a handful of these converted files in Parlamint-ES repo (together with the TEI root and all that). By the way, I can see my files in calzada /Parlamint. But I cannot see anything in clarin/Parlamint. Maybe this is the way it should be.

At any rate, we still have work to do to be ready for 24th March; UD and NER (and I really do not know what to do). I hope you can guide me.

Thanks for all this. BIIIIG THANKS. mc

TomazErjavec commented 3 years ago

All files ready!!

Great, you are a hero (actually, heroine:)!

I ran the conversion again, there are only 6 files that have validation problems: https://github.com/calzada/PARLAMINT-ES-MC/blob/f33f4d4bd7fced40c3900cd600a0822ae0e0fad2/log.txt#L735-L740 I think the reason is the same in all the cases, at the end of the original file there is a heading, but no intervention follows it, which gets converted to a div with a head, but without any speech, which is illegal.

Could you have a look at the original CD files and fix them? I think that you can either delete the final heading (if it contains junk), or change it to omit (if it contains some sort of note), and it should be ok. Let me know when you have done this, and then we close this issues and go on to the remaining things - not many, I think.

Could you please put a handful of these converted files in Parlamint-ES repo

I would wait a bit, until we have the basic corpus finished, so I don't need to do it too many times.

But I cannot see anything in clarin/Parlamint. Maybe this is the way it should be.

You mean you cannot see https://github.com/clarin-eric/ParlaMint/tree/main/ParlaMint-ES ? Yes, because it isn't there yet.

At any rate, we still have work to do to be ready for 24th March; UD and NER

Yes, but let's get the base corpus ok first. Then we will see how to do the linguistic annotation.

calzada commented 3 years ago

Have a look now, Tomaz. Sorry I could not do it before. Please, let me know if I have done it properly. I am a bit tired now. Also, since I do not master UD and NER I am trying to find help. When things are ready with TEI, do you think I can get an extension (maybe an extra week or something like this?)? It is proving a bit difficult to find someone but I will. Best for now and a BIG thank you. This is especially to thank you: https://www.youtube.com/watch?v=xHK17lJcXtM mc