Closed calzada closed 3 years ago
Dear @calzada, nice to hear you are happy! To make you even more so (I hope) I now put a sample of your files to our official ParlaMint GitHub repository, you can see it at https://github.com/clarin-eric/ParlaMint/tree/main/ParlaMint-ES
As for your questions:
You told us to beware because the Parlamint-ES corpus was not totally finished the way you want to finish it. So we are in no hurry of course but as soon as you are ready let us know.
Well, the corpus structure won't change, maybe just some stuff with metadata, which does not affect the linguistic annotation. So, if they start now (maybe first with the sample) and they develop the pipeline, it will be a simple matter to just re-run the complete pipeline on the finished corpus. So, no need to wait for me!
Also, I must admit that there is so much work piled up, that I am not sure if I will be able to do more work on your corpus - I hope I will, but can't promise. Anyway, I think the corpus is "correct", except that we are losing some information that you have in the original, but I didn't manage to convert it to the ParlaMint format. Hopefully, more about this later.
Are we right in thinking we have to produce the .ana.xml version of documents? I seem to have understood that you would do the .vert files.
Yes, you are right: you have to produce the .ana.xml version, all the other derived file formats I take care or.
What else do you need to do with the Spanish corpus (after the .ana.xml version)?
Validated it and convert it to derived formats, but that is all automatic.
We were also wondering about the validation scripts, described in the schema folder? Do we need to do those as well. Which ones for the .ana.xml version?
What you need to do at your side is to have a good look how the .ana version differs from the "plain" version (also in the teiHeaders) and make your .ana the same as the current samples and you need to validate it with the XML schemas, as described in the readme at https://github.com/clarin-eric/ParlaMint/tree/main/Schema.
There is also the content validation with Scripts/validate-parlamint.pl, but that needs some skill to modify for your platform and run, so I can do this validaton and send you the results, so you can correct any mistakes.
As soon as we start, it might take us (we think) 15 days to accomplish the task. Would this be fine? (Also Luciana is about to deliver a baby. Let's hope we can finish before she is due. :-)
Let's hope! But, as I say, you can start immediatelly.
Dear Tomaz, Thank you so much for this. It is really informative and we are now on the right track with your instructions. Hopefully, we will deliver and fulfil your expectations. We will certainly do our best. Please, do not worry so much about our corpus. If if it what you needed at PARLAMINT this should do the trick. I am worried you are working so hard. A minor question is, if ever in the future we wanted to modify what we have (metadata, etc.), would that be possible. Just asking (in case we want to add more files or add metadata we do not have like occupation, etc.).
And yeessssss you made me very happy today.
Best for now
mc
if ever in the future we wanted to modify what we have (metadata, etc.), would that be possible.
Well, you have the XSLT scripts that convert your encoding to ParlaMint here in the repo, so, yes, I guess so!
We will not touch anything in the near future. Maybe for the summer, we will try to improve the corpus and add more filles. Also, if we have other parliaments, provided I understand the order of the work-flow, we will consider creating new corpora. :-)
Best
mc
I think we can close this issue too!
Dear Tomaz, I feel so full of energy today. I have just had a videoconference with Luciana de Macedo and Andressa Gomide, who are going to produce the annotations (to get a bit of the load off your shoulders) and I think they are greeeeeaaaat. So I now have some questions:
1) You told us to beware because the Parlamint-ES corpus was not totally finished the way you want to finish it. So we are in no hurry of course but as soon as you are ready let us know. 2) Are we right in thinking we have to produce the .ana.xml version of documents? I seem to have understood that you would do the .vert files. 3) What else do you need to do with the Spanish corpus (after the .ana.xml version)? 4) We were also wondering about the validation scripts, described in the schema folder? Do we need to do those as well. Which ones for the .ana.xml version? 5) As soon as we start, it might take us (we think) 15 days to accomplish the task. Would this be fine? (Also Luciana is about to deliver a baby. Let's hope we can finish before she is due. :-) Anyway, Tomaz, Luciana was telling me that you have put so much work recently on our corpus that I feel a bit embarrassed and I would like to thank you for everything.
Best for now,
mc